I am working on reading from my csv file using python. But I want to read only specific(last valid) rows from the tail in csv also there is a catch that function should return the entire row only when it is valid. Can anyone help me out with this?
Below is my csv file looks like:
Sr. Add A B C D
0 0013A20040D6A141 -308.1 -307.6 -307.7 -154.063
1 0013A20040DC889A -308.7 -311.7 -311.7 -154.263
2 0013A20040DC88C3 -310.1 -310.1 -310.2 -154.863
3 0013A20040D6A141 -308.2 -306.8 -307.7 -153.863
4 0013A20040DC889A -308.7 -311.4 -311.1 -153.263
5 0013A20040DC88C3 -- -- -- --
6 0013A20040D6A141 -308.7 -308.3 -305.2 -154.663
and the code I am trying is:
def last_data(address):
i = sum(1 for line in open("filename.csv", 'r'))
print i # number of lines in csv
cache = {} # dict that saved the last data for particular address
n = 3
with open("filename.csv",'r') as f:
q = deque(f, 3) # 3 lines read at the end
qp = [''] * n
if i +1 >= n: # for checking whether the number of lines greater than number of add.
for k in range(n):
qp[k] = q[k].split(',')
if address == str(qp[k][1]): # check for particular address in row
# if the row has data than put it into json object with address as key and nested key as columns 'A', 'B', etc.
cache.update({address: {'A':struct.pack('>l',int(float(qp[k][3]) * 10)),
'C':struct.pack('>l',int(float(qp[k][4]) * 10))
}})
return cache[address]['A'], cache[address]['C']
For last_data('0013A20040DC88C3') return 5th row with invalid data, where I want to show 2nd row. Can any body tell me how to do this?
With pandas it would look like this:
Note: python 2.7. code. Change the import for the StringIo on Python3.
import pandas as pd
from StringIO import StringIO
input = """Sr. Add A B C D
0 0013A20040D6A141 -308.1 -307.6 -307.7 -154.063
1 0013A20040DC889A -308.7 -311.7 -311.7 -154.263
2 0013A20040DC88C3 -310.1 -310.1 -310.2 -154.863
3 0013A20040D6A141 -308.2 -306.8 -307.7 -153.863
4 0013A20040DC889A -308.7 -311.4 -311.1 -153.263
5 0013A20040DC88C3 -- -- -- --
6 0013A20040D6A141 -308.7 -308.3 -305.2 -154.663
"""
buffer = StringIO(input)
df = pandas.read_csv(buffer, delim_whitespace=True, na_values=["--"])
# you can customize the behaviour here, e.g. how many invalid values are ok per row.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
df = df.dropna()
Related
With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?
Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.
My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
I have a csv file that has a primary_id field and a version field and it looks like this:
ful_id version xs at_grade date
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
edit this is what the actual data looks like plus add 106 more columns of data and 20,000 records
The larger version number is the latest version of that record.I am having a difficult time thinking of the logic to get the latest record based on version and dumping that into a dictionary.I am pulling the info from the csv into a blank list but If anyone could give me some guidance on some of the logic moving forward, I would appreciate it
import csv
from collections import defaultdict
reader = csv.DictReader(open('rpm_inv.csv', 'rb'))
allData = list(reader)
dict_list = []
for line in allData:
dict_list.append(line)
pprint.pprint(dict_list)
I'm not exactly sure how you want your output to look like, but this might point you at least in the right direction, as long as you're not opposed to pandas.
import pandas as pd
df = pd.read_csv('rpm_inv.csv', header=True)
by_version = df.groupby('Version')
latest = by_version.max()
# To put it into a dictionary of {version:ID}
{v:row['ID'] for v, row in latest.iterrows()}
There's no need for anything fancy.
defaultdict is included in Python's standard library. It's an improved dictionary. I've used it here because it obviates the need to initialise entries in a dictionary. This means that I can write, for instance, result[id] = max(result[id], version). If no entry exists for id then defaultdict creates one and puts version in it (because it's obvious that this will be the maximum).
I read through the lines in the input file, one at a time, discarding end-lines and blanks, splitting on the commas, and then use map to apply the int function to each string produced.
I ignore the first line in the file simply be reading it and assigning its contents to a variable that I have arbitrarily called ignore.
Finally, just to make the results more intelligible, I sort the keys in the dictionary, and present the contents of it in order.
>>> from collections import defaultdict
>>> result = defaultdict(int)
>>> with open('to_dict.txt') as input:
... ignore = input.readline()
... for line in input:
... id, version = map(int, line.strip().replace(' ', '').split(','))
... result[id] = max(result[id], version)
...
>>> ids = list(result.keys())
>>> ids.sort()
>>> for id in ids:
... id, result[id]
...
(3, 1)
(11, 3)
(20, 2)
(400, 2)
EDIT: With that much data it becomes a different question, in my estimation, better processed with pandas.
I've put the df.groupby(['ful_id']).version.idxmax() bit in to demonstrate what I've done. I group on ful_id, then ask for the maximum value of version and the index of the maximum value, all in one step using idxmax. Although pandas displays this as a two-column table the result is actually a list of integers that I can use to select rows from the dataframe.
That's what I do with df.iloc[df.groupby(['ful_id']).version.idxmax(),:]. Here the df.groupby(['ful_id']).version.idxmax() part identifies the rows, and the : part identifies the columns, namely all of them.
Thanks for an interesting question!
>>> import pandas as pd
>>> df = pd.read_csv('different.csv', sep='\s+')
>>> df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
1 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
4 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
>>> df.groupby(['ful_id']).version.idxmax()
ful_id
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 0
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 3
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 2
Name: version, dtype: int64
>>> new_df = df.iloc[df.groupby(['ful_id']).version.idxmax(),:]
>>> new_df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
I need to read a variable number of columns from my input file ( the number of columns is defined by the user, there's no limitation ). For every column I have multiple variables to read, three in my case, set by the user as well.
So the file to read is like:
2 3 5
6 7 9
3 6 8
In Fortran this is really easy to do:
DO 180 I=1,NMOD
READ(10,*) QARR(I),RARR(I),WARR(I)
NMOD is defined by the user, as well as all the values in the example. All of them are input parameters to be stored in memory. By doing these I can save all the variables I need and I can use it whenever I want, recalling them by changing the I index. How can I obtain the same result with Python?
Example file 'text'
2 3 5
6 7 9
3 6 8
Python code
data = []
with open('text') as file:
columns_to_read = 1 # here you tell how many columns you want to read per line
for line in file:
data.append(list(map(int, line.split()[:columns_to_read])))
print(data) # print: [[2], [6], [3]]
data will hold an array of arrays that represent your lines.
from itertools import islice
with open('file.txt', 'rt') as f:
# default slice from row 0 until end with step 1
# example islice(10, 20, 2) take only row 10,12,14,16,18
dat = islice(f, 0, None, 1)
column = None # change column here, default to all
# this keep the list value as string
# mylist = [i.split() for i in dat]
# this keep the list value as int
mylist = [[int(j) for j for i.split()[:column] for i in dat]
Code above construct 2d list
access with mylist[row][column]
Example - mylist[2][3] access row 2 column 3
Edit : improve code efficiency with #Guillaume #Javier suggestion
I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?
You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2
I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.