I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row
Related
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()
This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])
I have a very large data file consists of N*100 real numbers, where N is very large. I want to read the data by columns. I can read it as whole then manipulate it column by column:
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
Or I can read it column by column and expecting this will save memory and be fast:
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col)
However, the second approach seems not to be faster. Is it because every time the code read the whole data and extract the desired the column? So it is 100 times slower than the first one. Is there any method to read one column after another but much faster?
If I just want to get the 100 number at last row from the file, reading the whole col and get the last elements is not wise choice, how to achieve this?
If I understand your question right, you want only the last row. This would read only the last row for N rows:
data = np.loadtxt(fname='data.txt', skiprows=N-1)
You are asking two things: how to sum across all rows, and how to read the last row.
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
data is a (N,100) 2d array. You don't need to iterate to sum along each column
np.sum(data, axis=0)
gives you a (100,) array, one sum per column.
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col) # just throwing this away??
With this you read the file 100 times. In each loadtxt call it has to read each line, select the ith string, interpret it, and collect it in col. It might be faster, IF data was so large that the machine bogged down with memory swapping. Other wise, array operations on data will be lot faster than file reads.
As the other answer shows loadtxt lets you specify a skiprows parameter. It will still read all the lines (i.e. f.readline() calls), but it just doesn't process them and collect the values in a list.
Do some of your own time tests.
I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!
if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column
I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.
It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.