check csv every 5 rows with condition using python3.x - python

csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!

I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !

Related

Optimize processing of large CSV file Python

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :
I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.
I have tried to use pandas as shown below, but results are taking a lot of time.
def is_av(Qterm, Cterm, df):
try:
return df.loc[(Qterm, Cterm),'val']
except KeyError:
return 0
Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file.
If not could you check this function and tell me what might be the issue of slow processing
for nc in L:#ID1
score = 0.0
for ni in id_list:#ID2
e = is_av(ni,nc,df_g)
InDegree = df1.loc[ni].values[0]
SumInMap = df2.loc[nc].values[0]
score = score + term_score(InDegree, SumInMap, e) #compute a score
key = pd_df3.loc[nc].values[0]
tmt[key] = score
TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.
The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.
Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.
Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.
Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.
You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/
You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/
import os
os.system('pip install dask')
import dask.dataframe as dd
dd_data = dd.read_csv('sample.csv')
bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
dd_result = dd_data[bool_filter_conditions][['val']]
dd_output = dd_result.compute()
dd_output

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

Calculate the rotational period automatically in python

I am beginner in programming and I use python. I have a code that calculate the rotation period of star, but I have to change the star ID each time, which will take me effort and time to complete it.
Can I change the star ID automatically?
from lightkurve import search_lightcurvefile
lcf = search_lightcurvefile('201691589').download() ## star Id = 201691589
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
print Prot
I saved all 'stars_ID' that I want to use in a txt file(starID.txt) with 10000 lines, and I want to calculate the rotation period (Prot) in an automatic way so that the code takes the star ID from the txt file one by one and do the calculations, then save the star_ID and Prot in a csv file (two columns: 'star_ID', 'Prot'). Can you please help me do it.
This should get you close but I don't have a bunch of star IDs handy, nor is this in my field.
The main points:
Use the csv module for reading and writing files.
When you have code that you need to call many times (well, oftentimes even just once for a logical grouping), you want to consider packaging it into a function
There are other pointers for you to research. If I didn't try and make things a little more succinct than basic loops then the code would be quite long, and I tried to not make it too terse. Hopefully it's enough for you to follow up on.
from lightkurve import search_lightcurvefile
import csv
# You need to read the file and get the star IDs. I'm taking a guess here that
# the file has a single column of IDs
with open('starID.txt') as infile:
reader = csv.reader(infile)
# Below is where my guess matters. A "list comprehension" assuming a single
# column so I just take the first value of each row.
star_ids = [item[0] for item in reader]
def data_reader(star_id):
"""
Function to read periodogram for a star_id
Returns [star_id, periodogram]
"""
lcf = search_lightcurvefile('201691589').download()
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
return [star_id, Prot]
# Now start calling the function on your list of star IDs and storing the result
results = []
for id_number in star_ids:
individual_result = data_reader(id_number) # Call function
results.append(individual_result) # Add function response to the result collection
# Now write the data out
with open('star_results.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerows(results)
You said 'beginner' so this is an answer for someone writing programs almost for the first time.
In Python the straight-forward way to change the star id each time is to write a function and call it multiple times with different star ids. Taking the code you have and changing into a function without changing the behavior at all might look like this:
from lightkurve import search_lightcurvefile
def prot_for_star(star_id):
lcf = search_lightcurvefile(star_id).download()
pg = lcf.PDCSAP_FLUX.remove_nans().to_periodogram()
return pg.frequency_at_max_power**-1
# Now use map to call the function for each star id in the list
star_ids = ['201691589', '201692382']
prots = map(prot_for_star, star_ids)
print(list(prots))
This could be inefficient code though. I don't know what this lightkurve package does exactly, so there could be additional ways to save time. If you need to do more than one thing with each lcf object, you might need your functions to be structured differently. Or if creating a periodogram is CPU intensive, and you end up generating the same ones multiple times, there could be ways to save time doing that.
But this is the basic idea of using abstraction to avoid repeating the same lines of code over and over again.
Combining the original star id with its period of rotation can be achieved like this. This is a bit of functional programming magic.
# interleave the star ids with their periods of rotation
for pair in zip(star_ids, prots):
# separate the id and prot with a comma for csv format
print(','.join(pair))
The output of the Python script can then be stored in a csv file.

Get unique values of every column from a gz file

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.
import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique
I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.
any shell script solution?
for example i have the data in my file as
5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,
and in want all unique values from each column.
With the gunzipped file, you could do:
awk -F, 'END { for (i=1;i<=NF;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh
Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.
A shell built pipeline could indeed do this job faster, though likely less memory efficient. The primary reasons are two: parallellism and native code.
First, since we have little description of the task, I'll have to read the Python code and figure out what it does.
from sets import Set is an odd line; sets are part of the standard library, and I don't know what your sets module contains. I'll have to guess it's at best another name for the standard set type, or at least a less efficient variant of the same concept.
gzip.open lets the script read a gzipped file. We can replace this with a zcat process.
csv.readerreads character separated values, in this case splitting on '|'. Deeper inside the code we find only one column (line[i]) is read, so we can replace it with cut or awk ... until i changes. awk can handle that case too, but it's a little trickier.
The trickiest part is the end logic. Every time 10 unique values are found in a column, the program outputs those values and switches to the next column. By the way, Python's for has an else clause specifically for this case, so you don't need a flag variable.
One of the odder parts of the code is how you catch all exceptions from the inner data processing block. Why is this? There are basically only two sources of exceptions in there: Firstly, the indexing could fail if there aren't that many columns. Secondly, the unknown Set type could be throwing exceptions; the standard set type would not.
So, the analysis of your function is: in a diagonal manner (since the file is never rewound, and columns are not processed in parallel), collect unique values from each column until ten are found, and print them. This means, for instance, that if the first column had less than ten unique items nothing is ever printed for any other columns. I'm not sure this is the logic you intended.
With such complicated logic, Python's set functionality actually is a good choice; if we could partition the data more easily then uniq might have been better. What throws us off is how the program moves from column to column and only wants a specific number of values.
Thus, the two big time wasters in the Python program are decompressing in the same thread as we do all the parsing, and splitting into all columns when we only need one. The former can be addressed using a thread, and the latter is probably best done using a regular expression such as r'^(?:[^|]*\|){3}([^|]*)'. That expression would skip three columns and the fourth can be read as group 1. It gets more complicated if the CSV has quoting to contain the separator within some column. We could do the line parsing itself in a separate thread, but that wouldn't solve the issue of the many unneeded string allocations.
Note that the problem actually becomes considerably different if what you really want is to process all columns from the start of the file. I also don't know why you specifically process 400 columns regardless of the amount that exist. If we remove those two constraints, the logic would be more like:
firstline=next(reader)
sets = [{column} for column in firstline]
for line in reader:
for column,columnset in zip(line,sets):
columnset.add(column)
this is a pure python version based on your idea:
from io import StringIO
from csv import reader
txt = '''5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,'''
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
s.add(item)
which yields for your input:
[{'129DC8', '41C528', '4DE8CD', '5C4423', '9E7F41', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094',
'CA39260W1023',
'NL0000344265',
'QA000A0NCQB1',
'US2333774071',
'US37253A1034'},
{'2000-01-01', '2008-03-06', '2012-09-07', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
oops, now that i have posted my answer i see, that this is exactly what Yann Vernier proposes at the end of his answer. please upvote this answer which was here way earlier than mine...
if you want to limit the number of unique values, you could use a deque as data structure:
from io import StringIO
from csv import reader
MAX_LEN = 3
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
if len(s) < MAX_LEN:
s.add(item)
print(unique)
with the result:
[{'41C528', '5C4423', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094', 'NL0000344265', 'US2333774071'},
{'2000-01-01', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
this way you would save some memory if one of your columns holds only unique values.

Using Python to write a CSV file with delimiter

I'm new to programming, and also to this site, so my apologies in advance for anything silly or "newbish" I may say or ask.
I'm currently trying to write a script in python that will take a list of items and write them into a csv file, among other things. Each item in the list is really a list of two strings, if that makes sense. In essence, the format is [[Google, http://google.com], [BBC, http://bbc.co.uk]], but with different values of course.
Within the CSV, I want this to show up as the first item of each list in the first column and the second item of each list in the second column.
This is the part of my code that I need help with:
with open('integration.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', dialect='excel')
writer.writerows(w for w in foundInstances)
For whatever reason, it seems that the delimiter is being ignored. When I open the file in Excel, each cell has one list. Using the old example, each cell would have "Google, http://google.com". I want Google in the first column and http://google.com in the second. So basically "Google" and "http://google.com", and then below that "BBC" and "http://bbc.co.uk". Is this possible?
Within my code, foundInstances is the list in which all the items are contained. As a whole, the script works fine, but I cannot seem to get this last step. I've done a lot of looking around within stackoverflow and the rest of the Internet, but I haven't found anything that has helped me with this last step.
Any advice is greatly appreciated. If you need more information, I'd be happy to provide you with it.
Thanks!
In your code on pastebin, the problem is here:
foundInstances.append(['http://' + str(num) + 'endofsite' + ', ' + desc])
Here, for each row in your data, you create one string that already has a comma in it. That is not what you need for the csv module. The CSV module makes comma-delimited strings out of your data. You need to give it the data as a simple list of items [col1, col2, col3]. What you are doing is ["col1, col2, col3"], which already has packed the data into a string. Try this:
foundInstances.append(['http://' + str(num) + 'endofsite', desc])
I just tested the code you posted with
foundInstances = [[1,2],[3,4]]
and it worked fine. It definitely produces the output csv in the format
1,2
3,4
So I assume that your foundInstances has the wrong format. If you construct the variable in a complex manner, you could try to add
import pdb; pdb.set_trace()
before the actual variable usage in the csv code. This lets you inspect the variable at runtime with the python debugger. See the Python Debugger Reference for usage details.
As a side note, according to the PEP-8 Style Guide, the name of the variable should be found_instances in Python.

Categories

Resources