After solving a sorting of a dataset, I have a problem at this point of my code.
with open(fns_land[xx]) as infile:
lines = infile.readlines()
for line in lines:
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
I have a problem in the lines line. In this line the data are sometimes to huge and i get a kill error.
Is there a short/nice way to rewrite this point?
Use readline instead, this read it one line at a time without loading the entire file into memory.
with open(fns_land[xx]) as infile:
while True:
line = infile.readline()
if not line:
break
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
If you are dealing with a dataset, I would suggest that you have a look at pandas, which I great for dealing with data wrangling.
If your problem is a large dataset, you could load the data in chunks.
import pandas as pd
tfr = pd.read_csv('fns_land{0}.csv'.format(xx), iterator=True, chunksize=1000)
Line: imported pandas modul
Line: read data from your csv file in chunks of 1000 lines.
This will be of type pandas.io.parsers.TextFileReader. To load the entire csv file, you follow up with:
df = pd.concat(tfr, ignore_index=True)
The parameter ignore_index=True is added to avoid duplicity of indexes.
You now have all your data loaded into a dataframe. Then do your data manipulation on the columns as vectors, which also is faster than regular line by line.
Have a look here this question that dealt with something similar.
Related
I have a very large dataframe with many million rows and it is normally not feasible to load the entire file into memory to work with. Recently some bad data have gotten in and I need to remove them from the database. So far what I've done is:
file = '/path to database'
rf = pd.read_csv(f'{file}.csv', chunksize = 3000000, index_col=False)
res = pd.concat([chunk[chunk['timestamp'] < 1.6636434764745E+018] for chunk in rf)]
res.to_csv(f'{file}.csv', index=False)
Basically it is opening the database and saving the part i want, overwriting the original file.
However the data has gotten so large that this is failing to fit in memory. Is there a better way to truncate a part of the dataframe based on a simple query?
The truncated part would usually be very small compared to the rest, say 100k rows and always at the end.
I would avoid using pandas in this case and just directly edit the csv file itself. For example:
import csv
with open("test_big.csv", "r") as f_in, open("test_out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
if int(row[-1]) > 9900: # your condition here
writer.writerow(row)
For context, test_big.csv looks like this
1,2,3,4,5891
1,2,3,4,7286
1,2,3,4,7917
1,2,3,4,937
...
And is 400,000 records long. Execution took 0.2s.
Edit: Ran with 40,000,000 records and took 15s.
Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.
I'm new to the Pandas library.
I have shared code that works off of a dataframe.
Is there a way to read a gzip file line by line without any delimiter (use the full line, the line can include commas and other characters) as a single row and use it in the dataframe? It seems that you have to provide a delimiter and when I provide "\n" it is able to read but error_bad_lines will complain with something like "Skipping line xxx: expected 22 fields but got 23" fields since each line is different.
I want it to treat each line as a single row in the dataframe. How can this be achieved? Any tips would be appreciated.
if you just want each line to be one row and one column then dont use read_csv. Just read the file line by line and build the data frame from it.
You could do this manually by creating an empty data frame with a single columns header. then iterate over each line in the file appending it to the data frame.
#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
df = pd.DataFrame([], columns=['line'])
for line in myfile:
df = df.append({'line': line}, ignore_index=True)
print(df)
This will work for large files as we only process one line at a time and build the dataframe so we dont use more memory than needed. This probably isnt the most efficent there is a lot of reassigning of the dataframe here but it would certainly work.
However we can do this more cleanly since the pandas dataframe can take an iterable as the input for data.
#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = [line for line in myfile]
df = pd.DataFrame(mydata, columns=['line'])
print(df)
Here we read all the lines of the file into a list and then pass the list to pandas to create the data from. However the down side to this is if our file was very large we would essentially have 2 copies of it in memory. One in list and one in the data frame.
Given that we know pandas will accept an iterable for the data so we can use a generator expression to give us a generator that will feed each line of the file to the data frame. Now the data frame will be built its self by reading each line one at a time from the file.
#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = (line for line in myfile)
df = pd.DataFrame(mydata, columns=['line'])
print(df)
In all three cases there is no need to use read_csv since the data you want to load isnt a csv. Each solution provides the same data frame output
SOURCE DATA
this is some data
this is other data
data is fun
data is weird
this is the 5th line
DATA FRAME
line
0 this is some data\n
1 this is other data\n
2 data is fun\n
3 data is weird\n
4 this is the 5th line
Suppose I have a csv file containing 5 rows.
Now I iterate over this file using a chunksize of 2.
data = pd.read_csv(data_name, header=None, iterator=True, chunksize=2)
Suppose I am doing some magic on this data chunk and appending it to another csv file.
processed_data.to_csv(fname, index=None, mode="a")
Problem: The last row is not written.
I do not know how to solve this problem. Can someone help?
I need to use chunks because I don't have enough RAM.
I can not use chunksize=1, because opening/closing a file is too time consuming.
If you are running out of memory I would use blaze for this type of data.
https://blaze.readthedocs.io/en/latest/ooc.html
Then you don't have to mess with the chunksize.
I have a large number of files that i want to import. I do this one by one with pandas. But some of them only have header text, and the actual contents is empty. This is on purpose, but I don't know which files are empty. Also, each file has a different number of columns, and the number of columns in each file is unknown. I use the following code:
lines = pandas.read_csv(fname, comment='#', delimiter=',', header=None)
Is there a way for pandas to return an empty data rame if it doesn't find any non-comment lines in a file? Or some other work around?
Thanks!