Python_How to parse data and make it into new file - python

Here is input.txt file
Jan_Feb 0.11
Jan_Mar -1.11
Jan_Apr 0.2
Feb_Jan 0.11
Feb_Mar -3.0
Mar_Jan -1.11
Mar_Feb -3.0
Mar_Apr 3.5
from this file, I am trying to create a dictionary from the input text file. 1) The keys are two values which is split with "_" from 1st column string of input file. 2) Moreover, if the name of column and row are same (such as Jan and Jan), write 0.0 as follows. 3) Lastly, if the keys are not found in the dictionary, write "NA". Output.txt
Jan Feb Mar Apr
Jan 0.0 0.11 -1.11 0.2
Feb 0.11 0.0 -3.0 NA
Mar -1.11 -3.0 0.0 3.5
Apr 0.2 NA 3.5 0.0
I would really appreciate if someone can help me figure out. Actually, there are about 100,000,000 rows * 2 columns in real input.txt. The name of Thank you so much in advance.

Others might disagree with this but one solution would be simply to read all 100 million lines into a relational database table (appropriately split-ing out what you need, of course) using a module that interfaces with MySQL or SQLite:
Your_Table:
ID
Gene_Column
Gene_Row
Value
Once they're in there, you can query against the table in something that resembles English:
Get all of the column headings:
select distinct Gene_Column from Your_Table order by Gene_Column asc
Get all of the values for a particular row, and which columns they're in:
select Gene_Column, Value from Your_Table where Gene_Row = "Some_Name"
Get the value for a particular cell:
select Value from Your_Table where Gene_Row = "Some_Name" and Gene_Column = "Another_Name"
That, and you really don't want to shuffle around 100 million records any more than you have to. Reading all of them into memory may be problematic as well. Doing it this way, you can construct your matrix one row at a time, and output the row to your file.
It might not be the fastest, but it will probably be pretty clear and straightforward code.

What you first need to do is get the data in an understandable format. So, first, you need to create a row. I would get the data like so:
with open('test.txt') as f:
data = [(l.split()[0].split('_'), l.split()[1]) for l in f]
# Example:
# [(['Jan', 'Feb'], '0.11'), (['Jan', 'Mar'], '-1.11'), (['Jan', 'Apr'], '0.2'), (['Feb', 'Jan'], '0.11'), (['Feb', 'Mar'], '-3.0'), (['Mar', 'Jan'], '-1.11'), (['Mar', 'Feb'], '-3.0'), (['Mar', 'Apr'], '3.5')]
headers = set([var[0][0] for var in data] + [var[0][1] for var in data])
# Example:
# set(['Jan', 'Apr', 'Mar', 'Feb'])
What you then need to do is create a mapping from your headers to your values, which are stored in data. Ideally, you would need to create a table. Take a look at this answer to help you figure out how to do that (we can't write your code for you).
Secondly, in order to print things out properly, you will need to use the format method. Ideally, it will help you deal with strings, and printing them in a specific fashion.
After that you can simply write like so with open('output.txt', 'w').

matrix = dict()
with open('inpu.txt') as f:
content = f.read()
tmps = content.split('\n')
for tmp in tmps:
s = tmp.split(' ')
latter = s[0].split('_')
try:
if latter[0] in matrix:
matrix[latter[0]][latter[1]] = s[1]
else:
matrix[latter[0]] = dict()
matrix[latter[0]][latter[1]] = s[1]
except:
pass
print matrix
And now in matrix you have table what u want.

If you want a dictionary as a result, something like that:
dico = {}
keyset=set()
with open('input.txt','r') as file:
line = file.readline()
keys = line.split('\t')[0]
value = line.split('\t')[1]
key1 = keys.split('_')[0]
keyset.add(key1)
key2 = keys.split('_')[1]
keyset.add(key2)
if key1 not in dico:
dico[key1] = {}
dico[key1][key2] = value
for key in keyset:
dico[key][key] = 0.0
for secondkey in keyset:
if secondkey not in dico[key].keys():
dico[key][secondkey]="NA"

1) Determine all of the possible headers for resulting column/row. In your example, that is A-D. How you do this can vary. You can parse the file 2x (not ideal, but it might be necessary), or perhaps you have someplace you can refer to for the distinct columns.
2) Establish headers. In the example above, you would have headers=["A","B","C","D"]. You can build this up during #1 if you have to parse the first column. Use len(indexes) to determine N
3) Parse the data, this time consider both columns. You will get the two keys using .split("_") on the first column, then you get the index for your data by doing simple arithmetic:
x,y = [headers.index(a) for a in row[0].split("_")]
data[x+y*len(headers)] = row[1]
This should be relatively fast, except for the parsing of the file twice. If it can fit into memory, you could load your file into memory and then scan over it twice, or use command line tricks to establish those header entries.
-- I should say that you will need to determine N before you begin storing the actual data. (i.e. data=[0]*N). Also, you'll need to use x+y*len(headers) during the save as well. If you are using numpy, you can use reshape to get an actual row/col layout which will be a little easier to manipulate and print (i.e. data[x,y]=row[1])
If you do a lot of large data manipulation, especially if you might be performing calculations, you really should look into learning numpy (www.numpy.org).

Given the size of your input, I would split this in several passes on your file:
First pass to identify the headers (loop on all lines, read the first group, find the headers)
Read the file again to find the values to put in the matrix. There are several options.
The simplest (but slow) option is to read the whole file for each line of your matrix, identifying only the lines you need in the file. This way, you only have one line of the matrix in memory at a time
From your example file, it seems your input file is properly sorted. If it is not, it can be a good idea to sort it. This way, you know that the line in the input file you are reading is the next cell in your matrix (except of course for the 0 diagonal which you only need to add.)

Related

Finding Min and Max Values in List Pattern in Python

I have a csv file and the data pattern like this:
I am importing it from csv file. In input data, there are some whitespaces and I am handling it by using pattern as above. For output, I want to write a function that takes this file as an input and prints the lowest and highest blood pressure. Also, it will return average of all mean values. On the other side, I should not use pandas.
I wrote below code blog.
bloods=open(bloodfilename).read().split("\n")
blood_pressure=bloods[4].split(",")[1]
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
re.findall(pattern,blood_pressure)
#now extract mean, min and max information from the blood_pressure of each patinet and write a new file called blood_pressure_modified.csv
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
outputfilename="blood_pressure_modified.csv"
# create a writeable file
outputfile=open(outputfilename,"w")
for blood in bloods:
patient_id, blood_pressure=bloods.strip.split(",")
mean=re.findall(pattern,blood_pressure)[0]
blood_pressure_modified=re.sub(pattern,"",blood_pressure)
print(patient_id, blood_pressure_modified, mean, sep=",", file=outputfile)
outputfile.close()
Output should looks like this:
This is a very simple kind of answer to this. No regex, pandas or anything.
Let me know if this is working. I can try making it work better for any case it doesn't work.
bloods=open("bloodfilename.csv").read().split("\n")
means = []
'''
Also, rather than having two list of mins and maxs,
we can have just one and taking min and max from this
list later would do the same trick. But just for clarity I kept both.
'''
mins = []
maxs = []
for val in bloods[1:]: #assuming first line is header of the csv
mean, ranges = val.split(',')[1].split('[')
means.append(int(mean.strip()))
first, second = ranges.split(']')[0].split('-')
mins.append(int(first.strip()))
maxs.append(int(second.strip()))
print(f'the lowest and the highest blood pressure are: {min(mins)} {max(maxs)} respectively\naverage of mean values is {sum(means)/len(means)}')
You can also create functions to perform small small strip stuff. That's usually a better way to code. I wrote this in bit hurry, so don't mind.
Maybe this could help with your question,
Suppose you have a CSV file like this, and want to extract only the min and max values,
SN Number
1 135[113-166]
2 140[110-155]
3 132[108-180]
4 40[130-178]
5 133[118-160]
Then,
import pandas as pd
df = pd.read_csv("REPLACE_WITH_YOUR_FILE_NAME.csv")
results = df['YOUR_NUMBER_COLUMN'].apply(lambda x: x.split("[")[1].strip("]").split("-"))
with open("results.csv", mode="w") as f:
f.write("MIN," + "MAX")
f.write("\n")
for i in results:
f.write(str(i[0]) + "," + str(i[1]))
f.write("\n")
f.close()
After you ran the snippet after without any errors then in your current working directory their should be a file named results.csv Open it up and you will have the results

read multi-line list from file

I have a file with data like:
POTENTIAL
TYPE 1
-5.19998150116627E+07 -5.09571848744513E+07 -4.99354600752570E+07 -4.89342214499422E+07 -4.79530582388520E+07
-4.69915679183017E+07 -4.60493560354389E+07 -4.51260360464197E+07 -4.42212291578282E+07 -4.33345641712756E+07
-4.24656773311163E+07 -4.16142121752159E+07 -4.07798193887125E+07 -3.99621566607090E+07 -3.91608885438409E+07
-3.83756863166569E+07
-8.99995987594328E+07 -8.81884626368405E+07 -8.64137733336537E+07 -8.46747974037847E+07 -8.29708161608188E+07
-8.13011253809965E+07 -7.96650350121689E+07 -7.80618688886128E+07 -7.64909644515842E+07 -7.49516724754953E+07
-7.34433567996002E+07 -7.19653940650832E+07 -7.05171734574350E+07 -6.90980964540154E+07 -6.77075765766936E+07
-6.63450391494693E+07
Note as per Nsh's comment these data are not single line. They always have 5 data per line, and as per this example, 4 row, with only one data in 4th row. So, I have 16 float spread over 4 line. I always know the total number (i.e. 16 in this case)
My aim is to read them as a list (please let me know if there is better things). The row with the single entry denotes end of a list (e.g. the list[1] ends with -3.83756863166569E+07).
I tried to read it as:
if line.startswith("POTENTIAL"):
lines = f.readline()
if lines.startswith("TYPE "):
lines=f.readline()
lines=lines.split()
lines = [float(i) for i in lines]
pots.append(lines)
print(pots)
which gives result:
[[-51999815.0116627, -50957184.8744513, -49935460.075257, -48934221.4499422, -47953058.238852]]
i.e. just the first line from the list, and not going any further.
My aim is to get them as different list (possibly) as:
pots[1]=[-5.19998150116627E+07....-3.83756863166569E+07]
pots[2]=[-8.99995987594328E+07....-6.63450391494693E+07]
I have read searched google extensively (the present state itself is from another SO question), but due to my inexperience, I cant solve my problem.
Kindly help.
use + instead of append.
It will append the elements of lines to pots.
pots = pots + lines
I didn't see in the start:
pots = []
It is needed in this case...
ITEMS_PER_LIST = 16
lists = [[]] # list of lists with initialized first sublist
with open('data.txt') as f:
for line in f:
if line.startswith(("POTENTIAL", "TYPE")):
continue
if len(lists[-1]) == ITEMS_PER_LIST:
lists.append([]) # create new list
lists[-1].extend([float(i) for i in line.split()])
Additional tweaks are required to validate headers.

Compare all the CSV files in a folder and print duplicate rows

I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))
What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Categories

Resources