I am getting memory error while creating dataframe. I am reading zip file from s3 and writing the Byte data into dataframe but I am getting memory error. Could you please help me how to avoid this or what changes can I do in my code?
code-
list_table = []
for table in d:
dict_table = OrderedDict()
s_time = datetime.datetime.now().strftime("%H:%M:%S")
print("start_time--->>",s_time)
print("tablename--->>", table)
s3 = boto3.resource('s3')
key='raw/vs-1/load-1619/data' +'/'+ table
obj = s3.Object('*******',key)
n = obj.get()['Body'].read()
gzipfile = BytesIO(n)
gzipfile = gzip.GzipFile(fileobj=gzipfile)
content = gzipfile.read()
#print(content)
content_str = content.decode('utf-8')
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
#print(df1)
#count = os.popen('aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/{0} - | wc -l'.format(table)).read()
count = int(len(df1)) - 2
del(df1)
e_time = datetime.datetime.now().strftime("%H:%M:%S")
print("End_time---->>",e_time)
print(count)
dict_table['Table_Name'] = str(table)
dict_table['Count'] = count
list_table.append(dict_table)
I am getting memory error in below line-
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
Error-
Traceback (most recent call last):
File "ravi_sir.py", line 45, in <module>
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
File "/app/python3/lib/python3.6/site-packages/pandas/core/frame.py", line 520, in __init__
mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1739, in form_blocks
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1784, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1830, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Does it help to utilize the Pandas series string split method?
# a sample string
content_str = 'a,b,c,d\nd,e,f,g\nh,i,j,k'
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
Posted here instead of the comments because it isn't pretty to post code there.
Related
Seeking for your assistance regarding this issue and I'm trying to resolve it, tried so many syntax but still getting the same error. I got multiple csv files to be converted and I'm pulling the same data, the script works for 1 of my csv file but not on the other. Looking forward to your feedback. Thank you very much.
My code:
import os
import pandas as pd
directory = 'C:/path'
ext = ('.csv')
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
if f.endswith(ext):
head_tail = os.path.split(f)
head_tail1 = 'C:/path'
k =head_tail[1]
r=k.split(".")[0]
p=head_tail1 + "/" + r + " - Revised.csv"
mydata = pd.read_csv(f)
# to pull columns and values
new = mydata[["A","Room","C","D"]]
new = new.rename(columns={'D': 'Qty. of Parts'})
new['Qty. of Parts'] = 1
new.to_csv(p ,index=False)
#to merge columns and values
merge_columns = ['A', 'Room', 'C']
merged_col = ''.join(merge_columns).replace('ARoomC', 'F')
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
new.drop(merge_columns, axis=1, inplace=True)
new = new.groupby(merged_col).count().reset_index()
new.to_csv(p, index=False)
The error I get:
Traceback (most recent call last):
File "C:Path\MyProject.py", line 34, in <module>
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
File "C:Path\MyProject.py", line 9565, in apply
return op.apply().__finalize__(self, method="apply")
File "C:Path\MyProject.py", line 746, in apply
return self.apply_standard()
File "C:Path\MyProject.py", line 873, in apply_standard
results, res_index = self.apply_series_generator()
File "C:Path\MyProject.py", line 889, in apply_series_generator
results[i] = self.f(v)
File "C:Path\MyProject.py", line 34, in <lambda>
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
TypeError: sequence item 1: expected str instance, int found
It's hard to say what you're trying to achieve without showing a sample of your data. But anyway, to fix the error, you need to cast the values as a string with str when calling pandas.Series.apply :
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(str(x)), axis=1)
Or, you can also use pandas.Series.astype:
new[merged_col] = new[merge_columns].astype(str).apply(lambda x: '.'.join(x), axis=1)
I have a number of tab delimited files on my database that have 7 columns with the headers 'name', 'taxID', 'taxRank', 'genomesize', 'numReads', 'numUniqueReads', 'abundance'. I would like to write a program that will call in a file generically (like using sys.argv) to bring in one file at a time and keep columns 0,1,4 (name, taxID, & numReads). I'm trying (very poorly) to do this in Python.
with open (sys.argv[1], 'r') as f:
rows = ( line.split('\t') for line in f)
d = { row[0]:row[1:] for row in rows }
df = pd.DataFrame(d)
sys.argv[1]_prekrona = df.drop(['taxRank','genomesize','numUniqueReads','abundance'], axis = 1)
After running the script I got the error:
Traceback (most recent call last):
File "open_file.py", line 19, in <module>
df.drop(['taxRank','genomesize','numUniqueReads','abundance'], axis = 1)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3697, in drop
errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3108, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3140, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4388, in drop
'labels %s not contained in axis' % labels[mask])
KeyError: "labels ['taxRank' 'genomesize' 'numUniqueReads' 'abundance'] not contained in axis"
This tells me I have to define my columns, but I am not sure how to do that. Any help is greatly appreciated!
Perhaps you could try:
df = pd.read_csv(f, usecols = ["name", "taxID", "numReads"]
usecols will take a list of column headers and create a subset of the data using only those columns.
I am trying to calculate the number of tweets of a single word for a single year while writing down each day and its number of tweets and store than to store it in CSV file with "Date" and "Frequency." This is my code, but I keep getting an error after running for some time.
import pandas as pd
import twint
import nest_asyncio
from datetime import datetime,timedelta
bugun = '2020-01-01'
yarin = '2020-01-02'
df = pd.DataFrame(columns=("Data","Frequency"))
for i in range(365):
file = open("Test.csv","w")
file.close()
bugun = (datetime.strptime(bugun, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
yarin =(datetime.strptime(yarin, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
nest_asyncio.apply()
c = twint.Config()
c.Search = "Chainlink"
#c.Hide_output=True
c.Since= bugun
c.Until= yarin
c.Store_csv = True
c.Output = "Test.csv"
c.Count = True
twint.run.Search(c)
data = pd.read_csv("Test.csv")
frequency = str(len(data))
#d = {"Data": [bugun], "Frequency": [frequency]}
#d_f = pd.DataFrame(data=d)
#df = df.append(d_f, ignore_index=True)
df.loc[i] = [bugun] + [frequency]
df.to_csv (r'C:\Users\serap\Desktop\CRYPTO 100\Chainlink.csv',index = False, header=False)
and the error I get is this
File "C:\Users\serap\Desktop\CRYPTO 100\CODES\Binance_Coin\Binance Coin.py", line 47, in <module>
data = pd.read_csv("Test.csv")
File "C:\Users\serap\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 605, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\serap\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 457, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\serap\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 814, in __init__
self._engine = self._make_engine(self.engine)
File "C:\Users\serap\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 1045, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "C:\Users\serap\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 1893, in __init__
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas\_libs\parsers.pyx", line 521, in pandas._libs.parsers.TextReader.__cinit__
EmptyDataError: No columns to parse from file
Thank you for the help :)
After reading a tutorial How to Scrape Tweets from Twitter with Python Twint | by Andika Pratama | Analytics Vidhya | Medium, I think you better let Twint do the iteration:
c = twint.Config()
c.Search = "Chainlink"
c.Since = "2020–01–01"
c.Until = "2021–01–01"
c.Store_csv = True
c.Output = "Test.csv"
c.Count = True
twint.run.Search(c)
Now you may loop over the CSV output:
data = pd.read_csv("Test.csv")
# ...
Until now, I didn't find this detail about CSV output documented, but the twint source code (master/twint/storage/write.py (line 58 ff)) tells, that for CSV the output is appended if the file already exists. So you may have to truncate it or delete an existing file before. A valid option for this could be
open(`Test.csv`, 'w').close()
... which is basically the same you do but without introducing another variable.
I am having trouble running a script for getting counts of predictions from csv files at a given directory. The format of the csv looks like this:
Sample data
and the code is the following:
import os
from glob import glob
import pandas as pd
def get_count(distribution, keyname):
try:
count = distribution[keyname]
except KeyError:
count = 0
return count
main_path = "K:\\...\\folder_name"
folder_paths = glob("%s\\*" % main_path)
data = []
for path in folder_paths:
file_name = os.path.splitext(os.path.basename(path))[0]
results = pd.read_csv(path, error_bad_lines=False)results['Label'] = pd.Series(results['Filename'].str.split("\\").str[0])
distribution = results.Predictions.value_counts()
print(distribution)
num_of_x = get_count(distribution, "x")
num_of_y = get_count(distribution,"y")
num_of_z = get_count(distribution,"z")
d = {"filename": file_name, "x": num_of_x, "y": num_of_y, "z": num_of_z}
data.append(d)
df = pd.DataFrame(data=data)
df.to_csv(os.path.join(main_path,"summary_counts.csv"), index=False)
the output error is Keyerror: "Filename" reffering to the pd.Series function, anyone would know how to solve this?
I am using Python 3.7.3 and pandas 1.0.5 and I am a beginner in programming...
Many thanks in advance
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".\save_counts.py", line 24, in <module>
results['Label'] = pd.Series(results['Filename'].str.split("\\").str[0])
File "K:\...\lib\site-packages\pandas\core\frame.py
", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "K:\...\site-packages\pandas\core\indexes\
base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get
_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get
_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.has
htable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.has
htable.PyObjectHashTable.get_item
KeyError: 'Filename'
in here:
for path in folder_paths:
file_name = os.path.splitext(os.path.basename(path))[0]
results = pd.read_csv(path, error_bad_lines=False)
results['Label'] = pd.Series(results['Filename'].str.split("\\").str[0])
you are creating pd.Series, but those values exist only inside this for loop.
if after this loop you want to use results df in distribution you need to use append()
create empty df and append results in this df
final_results = pd.Dataframe()
for path in folder_paths:
file_name = os.path.splitext(os.path.basename(path))[0]
results = pd.read_csv(path, error_bad_lines=False)
results['Label'] = pd.Series(results['Filename'].str.split("\\").str[0])
final_results = final_results.append(results)
#and from this point you can continue
I have written a function in python in order to do some experiments on dataframe 'Demand' using information in each row of dataframe 'File2':
def Compute(row):
File = pd.concat([File2[ (File2.Number == row['Number']) ]]*len(Demand), ignore_index = True)
File.Number = Demand.Number
result = pd.merge(Demand, File, on = 'Number')
result['Situation'] = 0
result.Situation = result.apply(lambda r: 1 if (r.Arrival <= r.Time2) & (r.Departure > r.Time2) & (r.Scenario == r.Scenario2) & (sum(pd.Series(r.Station).isin([row['Station2']])) != 0) else 0, axis = 1)
if len(result) != 0:
result = result[ (result.Situation == 1) ]
return result
File2.apply(Compute, axis = 1)
While I don't get any error using simple for loop on 'File2', I am getting the following error when I use this function:
Traceback (most recent call last):
File "<ipython-input-133-cbdaa2bf5936>", line 1, in <module>
File2.apply(Compute, axis = 1)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3972, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4081, in _apply_standard
result = self._constructor(data=results, index=index)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 226, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 363, in _init_dict
dtype=dtype)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 5163, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\frame.py", line 5477, in _homogenize
raise_cast_failure=False)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\series.py", line 2887, in _sanitize_array
subarr = _asarray_tuplesafe(data, dtype=dtype)
File "C:\Users\Research\Anaconda3\lib\site-packages\pandas\core\common.py", line 2011, in _asarray_tuplesafe
result[:] = [tuple(x) for x in values]
ValueError: cannot copy sequence with size 11 to array axis with dimension 0
"cannot copy sequence with size 11 to array axis with dimension 0"
Could you please tell me what is the potential problem?