How to add two columns efficiently in Pandas DataFrame?

How to add two columns efficiently in Pandas DataFrame? - python

I have quite large dataset (over 6 million rows with just a few columns). When I try to add two float64 columns (data['C'] = data.A + data.B) it gives me a memory error:
Traceback (most recent call last):
File "01_processData.py", line 354, in <module>
prepareData(snp)
File "01_processData.py", line 161, in prepareData
data['C'] = data.A + data.C
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 480, in wrapper
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/index.py", line 976, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1304, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1345, in _join_non_unique
how=how, sort=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 465, in _get_join_indexers
return join_func(left_group_key, right_group_key, max_groups)
File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError
I understand that this operation uses index to properly calculate output, but it seems inefficient, since by the fact that two columns belong to the same DataFrame they have perfect alignment.
I was able to solve the problem by using
data['C'] = data.A.values + data.B.values
but I wonder if there is a method designed to do this or more elegant solution?

I cannot reproduce what you are doing (as it won't hit the alignment issue as the indexes are the same).
In master/0.14 (releasing shortly)
In [2]: df = DataFrame(np.random.randn(6000000,2),columns=['A','C'],index=pd.MultiIndex.from_product([['foo','bar'],range(3000000)]))
In [3]: df.values.nbytes
Out[3]: 96000000
In [4]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 625.839844 MB per loop
However in 0.13.1. (I do remember some optimizations were put in 0.14)
In [3]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 1113.671875 MB per loop

Do you have a hierarchical index set? My python used to crash with that, but reset_index() prior to summing used to help. However, this was not reproduced by others, so this is not a "guaranteed improvement".
See my post on this

Related

MemoryError from melt or concat with large data

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)
Here is my original code:
melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')
After modifying:
pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
melted = pd.concat(pivot_list).sort_values('ID')
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
return op.get_result()
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
new_data = concatenate_managers(
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
values = _concatenate_join_units(join_units, concat_axis, copy=copy)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
to_concat = [
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
values = algos.take_nd(values, indexer, axis=ax)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.

Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.
pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.
Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.
As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.
Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?

With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

How can I name columns in a CSV output by pandas?

I am having some trouble writing a csv file that contains two columns. the first column contains intervals or bins while the second column contains a count of things in those bins. I made this csv file from another csv file containing raw data points. I am able to write the file but I am unable to name the columns. I expect that the output file should be a csv with two columns, so I supplied a list of two names to the .to_csv function and it comes up with this error
Traceback (most recent call last):
File "C:/Users/willi/Documents/Python/csv_processing_scratch/simple_csv_processor.py", line 65, in <module>
create_binned_csv_counts(dir_stringx, data_bin_edges, "value_counts_x_frameintervalsize_" + str(frame_interval_size))
File "C:/Users/willi/Documents/Python/csv_processing_scratch/simple_csv_processor.py", line 36, in create_binned_csv_counts
pd.cut(data_array, bin_edges).value_counts().to_csv(vcfilestring,index_label=True, header=["Coordinate Bins", "Counts for time interval " + str(i)])
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\core\series.py", line 4685, in to_csv
return self.to_frame().to_csv(**kwargs)
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\core\generic.py", line 3228, in to_csv
formatter.save()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 202, in save
self._save()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 310, in _save
self._save_header()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 239, in _save_header
raise ValueError(
ValueError: Writing 1 cols but got 2 aliases
The code block its coming from is this one
def create_binned_csv_counts(maindirectorystring, bin_edges, valuecountstring):
i = 0
for filename in os.listdir(maindirectorystring):
vcfilestring = str(filename[0:18]) + "_value_counts.csv"
os.chdir(maindirectorystring)
os.chmod(filename, 0o7777)
df = pd.read_csv(filename)
data_array = df["Coordinates for bin " + str(i)].to_numpy()
os.chdir(cwd)
os.chdir(valuecountstring)
pd.cut(data_array, bin_edges).value_counts().to_csv(vcfilestring,index_label=True, header=["Coordinate Bins", "Counts for time interval " + str(i)])
os.chdir(cwd)
i += 1
I was thinking it has something to do with the data types returned by cut and value_counts but searching through the documentation for those pandas methods wasnt very enlightening.
Let me know if I can provide more information, I appreciate any and all help I can get.
Also relevant, the first few lines of the output csv when I dont name the columns, I also am unsure of why that zero is there.
0
"(-10, -9]",0
"(-9, -8]",0
"(-8, -7]",0
"(-7, -6]",0
"(-6, -5]",0
"(-5, -4]",0
"(-4, -3]",0
"(-3, -2]",21
"(-2, -1]",13
"(-1, 0]",33
"(0, 1]",74
"(1, 2]",285
I would like it to look something like this
"Coordinate bins", "Count"
"(-10, -9]",0
"(-9, -8]",0
"(-8, -7]",0
"(-7, -6]",0
"(-6, -5]",0
"(-5, -4]",0
"(-4, -3]",0
"(-3, -2]",21
"(-2, -1]",13
"(-1, 0]",33
"(0, 1]",74
"(1, 2]",285

Okay, YOLO helped me start thinking in the right direction, I changed the line with the to_csv file to this
pd.cut(data_array,bin_edges).value_counts().to_csv(vcfilestring,index_label="Coordinate Bins",index=True, header=["Counts for time interval " + str(i)])

Python pandas duplicate values error

I have a large tab delimited data file, and I want to read it in python using pandas "read_csv or 'read_table' function. When I am reading this large file it is showing me the following error, even after turning off the "index_col" value.
>>> read_csv("test_data.txt", sep = "\t", header=0, index_col=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 160, in _read
return parser.get_chunk()
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 613, in get_chunk
raise Exception(err_msg)
Exception: Implicit index (columns 0) have duplicate values [372, 1325, 1497, 1636, 2486,<br> 2679, 3032, 3125, 4261, 4669, 5215, 5416, 5569, 5783, 5821, 6053, 6597, 6835, 7485, 7629, 7684, 7827, 8590, 9361, 10194, 11199, 11707, 11782, 12397, 15134, 15299, 15457, 15637, 16147, 17448,<br> 17659, 18146, 18153, 18398, 18469, 19128, 19433, 19702, 19830, 19940, 20284, 21724, 22764, 23514, 25095, 25195, 25258, 25336, 27011, 28059, 28418, 28637, 30213, 30221, 30574, 30611, 30871, 31471, .......
I thought I might have duplicate values in my data and thus used grep to redirect some of these values into a file.
grep "9996744\|9965107\|740645\|9999752" test_data.txt > delnow.txt
Now, when I read this file, it is read correctly as you can see below.
>>> read_table("delnow.txt", sep = "\t", header=0, index_col=None)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns:
0740645 20 non-null values
M 20 non-null values
BLACK/CAPE VERDEAN 20 non-null values
What is going on here? I am struggling for a solution but to no avail.
I also tried 'uniq' command in unix to see if duplicate lines exist but could not find any.
Does it has to do something with chunk-size?
I am using the following version of pandas
>>> pandas.__version__
'0.7.3'
>>>

Installed pandas latest version.
I am able to read now.
>>> import pandas
>>> pandas.__version__
'0.8.1'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.