How can I name columns in a CSV output by pandas? - python

I am having some trouble writing a csv file that contains two columns. the first column contains intervals or bins while the second column contains a count of things in those bins. I made this csv file from another csv file containing raw data points. I am able to write the file but I am unable to name the columns. I expect that the output file should be a csv with two columns, so I supplied a list of two names to the .to_csv function and it comes up with this error
Traceback (most recent call last):
File "C:/Users/willi/Documents/Python/csv_processing_scratch/simple_csv_processor.py", line 65, in <module>
create_binned_csv_counts(dir_stringx, data_bin_edges, "value_counts_x_frameintervalsize_" + str(frame_interval_size))
File "C:/Users/willi/Documents/Python/csv_processing_scratch/simple_csv_processor.py", line 36, in create_binned_csv_counts
pd.cut(data_array, bin_edges).value_counts().to_csv(vcfilestring,index_label=True, header=["Coordinate Bins", "Counts for time interval " + str(i)])
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\core\series.py", line 4685, in to_csv
return self.to_frame().to_csv(**kwargs)
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\core\generic.py", line 3228, in to_csv
formatter.save()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 202, in save
self._save()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 310, in _save
self._save_header()
File "C:\Users\willi\AppData\Roaming\Python\Python38\site-packages\pandas\io\formats\csvs.py", line 239, in _save_header
raise ValueError(
ValueError: Writing 1 cols but got 2 aliases
The code block its coming from is this one
def create_binned_csv_counts(maindirectorystring, bin_edges, valuecountstring):
i = 0
for filename in os.listdir(maindirectorystring):
vcfilestring = str(filename[0:18]) + "_value_counts.csv"
os.chdir(maindirectorystring)
os.chmod(filename, 0o7777)
df = pd.read_csv(filename)
data_array = df["Coordinates for bin " + str(i)].to_numpy()
os.chdir(cwd)
os.chdir(valuecountstring)
pd.cut(data_array, bin_edges).value_counts().to_csv(vcfilestring,index_label=True, header=["Coordinate Bins", "Counts for time interval " + str(i)])
os.chdir(cwd)
i += 1
I was thinking it has something to do with the data types returned by cut and value_counts but searching through the documentation for those pandas methods wasnt very enlightening.
Let me know if I can provide more information, I appreciate any and all help I can get.
Also relevant, the first few lines of the output csv when I dont name the columns, I also am unsure of why that zero is there.
0
"(-10, -9]",0
"(-9, -8]",0
"(-8, -7]",0
"(-7, -6]",0
"(-6, -5]",0
"(-5, -4]",0
"(-4, -3]",0
"(-3, -2]",21
"(-2, -1]",13
"(-1, 0]",33
"(0, 1]",74
"(1, 2]",285
I would like it to look something like this
"Coordinate bins", "Count"
"(-10, -9]",0
"(-9, -8]",0
"(-8, -7]",0
"(-7, -6]",0
"(-6, -5]",0
"(-5, -4]",0
"(-4, -3]",0
"(-3, -2]",21
"(-2, -1]",13
"(-1, 0]",33
"(0, 1]",74
"(1, 2]",285

Okay, YOLO helped me start thinking in the right direction, I changed the line with the to_csv file to this
pd.cut(data_array,bin_edges).value_counts().to_csv(vcfilestring,index_label="Coordinate Bins",index=True, header=["Counts for time interval " + str(i)])

Related

MemoryError from melt or concat with large data

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)
Here is my original code:
melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')
After modifying:
pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
melted = pd.concat(pivot_list).sort_values('ID')
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
return op.get_result()
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
new_data = concatenate_managers(
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
values = _concatenate_join_units(join_units, concat_axis, copy=copy)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
to_concat = [
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
values = algos.take_nd(values, indexer, axis=ax)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.
Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.
pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.
Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.
As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.
Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

Cannot concatenate object when adding to a DataFrame

I am trying to add a sentence as well as a coin(like a label in this case I guess) to a DataFrame. Although I keep getting this error:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\filter.py", line 132, in <module>
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 2877, in append
return concat(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 294, in concat
op = _Concatenator(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 384, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid
Here is the code:
data = pd.read_csv('C:\\Users\\gjohn\\Documents\\code\\machineLearning\\trading_bot\\testreviews.csv')
df = data['review'] # Create a dataframe of the reviews.
classes = data['class'] # Create a dataframe of the classes.
for sentence in sentences:
coin = find_coin(common_words, sentence)
if len(sentence) > 0 and coin != None:
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
I can't find how to fix this and I really need help, it would be great if you could help me out. Thanks!
Also sorry for the messy code :D
What is the sentence you use to construct the dictionary?
Perhaps you should check if the dictionary is constructed correctly?

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

Tableau error "All Fields must be aggregate or constant" when invoking TabPy SCRIPT_REAL

I am calling a TabPy server via a calculated field in a Tableau worksheet to run a hypothesis test: does the rate of Bookings vary significantly by Group?
I have a table such as:
Group Bookings
0 A 1
1 A 0
3998 B 1
3999 B 0
In Python, on the same server (using the python 2.7 docker image) the test I want is simply:
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(df['Group'], df['Bookings'])
prop_test = fisher_exact(df_cont_tbl)
print 'Fisher exact test: Odds ratio = {:.2f}, p-value = {:.3f}'.format(*prop_test)
Returns: Fisher exact test: Odds ratio = 1.21, p-value = 0.102
I connected Tableau to the TabPy server and can execute a hello-world calculated field. For example, I get 42 back with the calculated field: SCRIPT_REAL("return 42", ATTR([Group]),ATTR([Bookings]) )
However, I try to invoke the stats function above with a calculated field to extract the p-value:
SCRIPT_REAL("
import pandas as pd
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(_arg1, _arg2)
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
", [Group], [Bookings] )
I get the notification: The calculation contains errors with the drop-down All fields must be aggregate or constant when using table calculation functions or fields from multiple data sources
I tried wrapping the inputs with ATTR(), as in:
SCRIPT_REAL("
import pandas as pd
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(_arg1, _arg2)
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
",ATTR([Group]), ATTR([Bookings])
)
Which changes the notification to "The calculation is valid" but returns a Pandas ValueError from the server:
An error occurred while communicating with the External Service.
Error processing script
Error when POST /evaluate: Traceback
Traceback (most recent call last):
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tabpy_server/tabpy.py", line 467, in post
result = yield self.call_subprocess(function_to_evaluate, arguments)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tabpy_server/tabpy.py", line 488, in call_subprocess
ret = yield future
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_base.py", line 400, in result
return self.__get_result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_base.py", line 359, in __get_result
reraise(self._exception, self._traceback)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_compat.py", line 107, in reraise
exec('raise exc_type, exc_value, traceback', {}, locals_)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/thread.py", line 61, in run
result = self.fn(*self.args, **self.kwargs)
File "<string>", line 5, in _user_script
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/tools/pivot.py", line 479, in crosstab
df = DataFrame(data)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 266, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 402, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 5398, in _arrays_to_mgr
index = extract_index(arrays)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 5437, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Error type : ValueError
Error message : If using all scalar values, you must pass an index
Example dataset:
To generate the CSV I am connecting to:
import os
import pandas as pd
import numpy as np
from collections import namedtuple
OUTPUT_LOC = os.path.expanduser('~/TabPy_demo/ab_test_demo_results.csv')
GroupObs = namedtuple('GroupObs', ['name','n','p'])
obs = [GroupObs('A',3000,.10),GroupObs('B',1000,.13)]
# note true odds ratio = (13/87)/(10/90) = 1.345
np.random.seed(2019)
df = pd.concat( [ pd.DataFrame({'Group': grp.name,
'Bookings': pd.Series(np.random.binomial(n=1,
p=grp.p, size=grp.n))
}) for grp in obs
],ignore_index=True )
df.to_csv(OUTPUT_LOC,index=False)
Old question, but perhaps this will help someone else. There are a couple of issues. First is in relation to the way the data is passed to the pd.crosstab. Tableau passes a list of values to the tabpy server so wrap this in an array to fix your error you are getting.
SCRIPT_REAL(
"
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(np.array(_arg1), np.array(_arg2))
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
",
attr([Group]), attr([Bookings])
)
Another problem is the way the table calculation is being performed. You want to send tabpy two lists of information each as long as your table. In the default case tableau wants to calculate at the row level which is not going to work.
I included the row count F1 into the csv that I built the workbook on and made sure to calculate the python value along this function.
Now when you put F1 into the worksheet it will return the P-value as many times as you have rows, A workaround for this is to wrap your calculation in another calculation to only return the value if it is the first row and place this in your worksheet.
Now you can place this into a worksheet.

How to add two columns efficiently in Pandas DataFrame?

I have quite large dataset (over 6 million rows with just a few columns). When I try to add two float64 columns (data['C'] = data.A + data.B) it gives me a memory error:
Traceback (most recent call last):
File "01_processData.py", line 354, in <module>
prepareData(snp)
File "01_processData.py", line 161, in prepareData
data['C'] = data.A + data.C
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 480, in wrapper
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/index.py", line 976, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1304, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1345, in _join_non_unique
how=how, sort=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 465, in _get_join_indexers
return join_func(left_group_key, right_group_key, max_groups)
File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError
I understand that this operation uses index to properly calculate output, but it seems inefficient, since by the fact that two columns belong to the same DataFrame they have perfect alignment.
I was able to solve the problem by using
data['C'] = data.A.values + data.B.values
but I wonder if there is a method designed to do this or more elegant solution?
I cannot reproduce what you are doing (as it won't hit the alignment issue as the indexes are the same).
In master/0.14 (releasing shortly)
In [2]: df = DataFrame(np.random.randn(6000000,2),columns=['A','C'],index=pd.MultiIndex.from_product([['foo','bar'],range(3000000)]))
In [3]: df.values.nbytes
Out[3]: 96000000
In [4]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 625.839844 MB per loop
However in 0.13.1. (I do remember some optimizations were put in 0.14)
In [3]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 1113.671875 MB per loop
Do you have a hierarchical index set? My python used to crash with that, but reset_index() prior to summing used to help. However, this was not reproduced by others, so this is not a "guaranteed improvement".
See my post on this

Categories

Resources