How to use str.replace in Python 3 [duplicate] - python

This question already has an answer here:
TypeError: Type aliases cannot be used with isinstance()
(1 answer)
Closed 4 years ago.
I am trying to replace patterns by strings in a column of my dataframe. In Python 2 this works fine.
df = df['Transaction Description'].str.replace("apple", "pear")
But in Python 3 it gives me:
TypeError Traceback (most recent call last)
<ipython-input-10-1f1e7cb2faf3> in <module>()
----> 1 df = df['Transaction Description'].str.replace("apple", "pear")
~/.local/lib/python3.5/site-packages/pandas/core/strings.py in replace(self, pat, repl, n, case, flags, regex)
2427 def replace(self, pat, repl, n=-1, case=None, flags=0, regex=True):
2428 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 2429 flags=flags, regex=regex)
2430 return self._wrap_result(result)
2431
~/.local/lib/python3.5/site-packages/pandas/core/strings.py in str_replace(arr, pat, repl, n, case, flags, regex)
637 raise TypeError("repl must be a string or callable")
638
--> 639 is_compiled_re = is_re(pat)
640 if regex:
641 if is_compiled_re:
~/.local/lib/python3.5/site-packages/pandas/core/dtypes/inference.py in is_re(obj)
217 """
218
--> 219 return isinstance(obj, re_type)
220
221
/usr/lib/python3.5/typing.py in __instancecheck__(self, obj)
258
259 def __instancecheck__(self, obj):
--> 260 raise TypeError("Type aliases cannot be used with isinstance().")
261
262 def __subclasscheck__(self, cls):
TypeError: Type aliases cannot be used with isinstance().
What's the right way to do this in python 3?
I am using pandas 0.23.0 in both cases.

That's a bug in Python 3.5.2 and lower
What is your python version?
Source
0.23.1 saved the day
#Anush
Note:
You should have first searched google for "TypeError: Type aliases cannot be used with isinstance()" and you would have arrived at the answer immediately

Related

0 is not in range in pandas

So I am trying to load my data from jupyter into mysql workbench using the one "multiple row" insert statement. I will achieve that with a for loop and i am receiving some error messages.
First, a little background:
So I had my csv file which contains data set for preprocessing and I split into 2 here:
Before_handwashing=copy_monthly_df.iloc[:76]
After_handwashig=copy_monthly_df.iloc[76:]
I have successfully structured and loaded the first data set Before_handwashing into mysql work bench using this for loop below.
for x in range(Before_handwashing.shape[0]):
insert_query+='('
for y in range(Before_handwashing.shape[1]):
insert_query+= str(Before_handwashing[Before_handwashing.columns.values[y]][x])+', '
insert_query=insert_query[:-2]+'), '
Now I want to structure and load my second part of the dataset which is After_handwashig into mysql workbench using a similar code structure here.
for x in range(After_handwashig.shape[0]):
insert_query+='('
for y in range(After_handwashig.shape[1]):
insert_query+=str(After_handwashig[After_handwashig.columns.values[y]][x])+', '
insert_query=insert_query[:-2]+'), '
And I am recieving the following error messages
error message: ValueError
Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
384 try:
--> 385 return self._range.index(new_key)
386 except ValueError as err:
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
KeyError
Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_4316/2677185951.py in <module>
2 insert_query+='('
3 for y in range(After_handwashig.shape[1]):
----> 4 insert_query+=str(After_handwashig[After_handwashig.columns.values[y]][x])+', '
5 insert_query=insert_query[:-2]+'), '
~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
940
941 elif key_is_scalar:
--> 942 return self._get_value(key)
943
944 if is_hashable(key):
~\anaconda3\lib\site-packages\pandas\core\series.py in
_get_value(self, label, takeable)
1049
1050 # Similar to Index.get_value, but we do not fall back to positional
-> 1051 loc = self.index.get_loc(label)
1052 return self.index._get_values_for_loc(self, loc, label)
1053
~\anaconda3\lib\site-packages\pandas\core\indexes\range.py in
get_loc(self, key, method, tolerance)
385 return self._range.index(new_key)
386 except ValueError as err:
--> 387 raise KeyError(key) from err
388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 0
Can someone help me out in answering this problem?
OK, it took me a moment to find this. Consider these statements:
Before_handwashing=copy_monthly_df.iloc[:76]
After_handwashig=copy_monthly_df.iloc[76:]
When these are done, Before contains lines with indexes 0 to 75. After contains lines starting with index 76. There is no line with index 0, so your attempt to access it causes a key error.
There are two solutions. One is to use iloc to reference lines by ordinal instead of by index:
insert_query+=str(After_handwashig[After_handwashig.columns.values[y]]./iloc(x))+', '
The other is to reset the indexes to start from 0:
After_handwashing.reset_index(drop=True, inplace=True)
There's really no point in splitting the dataframe like that. Just have your first loop do range(76): and the second do range(76,copy_monthly_dy.shape[1]):.

Error wont dissapear, even when deleting/commenting the faulty lines

im am using Linux Ubuntu on a Virtual machine on Windows 10.
I have downloaded a IPython Notebook from dms_tools
Now when I try to run certain parts of the code I become the following error:
/usr/lib/python3.8/distutils/version.py in _cmp(self, other)
335 if self.version == other.version:
336 return 0
--> 337 if self.version < other.version:
338 return -1
339 if self.version > other.version:
TypeError: '<' not supported between instances of 'str' and 'int'
Since I did not know, hot to solve this problem I decided to just eddit this version.py file (perhaps not so smart, but I did not know what else to do...)
I just decided to Comment the faulty lines and return 0 everytime.
Now the weird part, I still get the same error pointing on the comments:
/usr/lib/python3.8/distutils/version.py in _cmp(self, other)
335 # if self.version == other.version:
336 # return 0
--> 337 # if self.version < other.version:
338 # return -1
339 # if self.version > other.version:
TypeError: '<' not supported between instances of 'str' and 'int'
Now I tested what would happen if I just added some empty new lines and the error looks like this
(pointing at the same line, where nothing even is):
/usr/lib/python3.8/distutils/version.py in _cmp(self, other)
335
336
--> 337
338
339
TypeError: '<' not supported between instances of 'str' and 'int'
I just can not explain what is happening here and I hope someone has an idea.
The complete Traceback is:
TypeError Traceback (most recent call last)
/tmp/ipykernel_2888/2193680867.py in <module>
1 fastqdir = os.path.join(resultsdir, './FASTQ_files/')
2 print("Downloading FASTQ files from the SRA...")
----> 3 dms_tools2.sra.fastqFromSRA(
4 samples=samples,
5 fastq_dump='/home/andreas/sratoolkit.2.11.0-ubuntu64/bin/fastq-dump',
~/.local/lib/python3.8/site-packages/dms_tools2/sra.py in fastqFromSRA(samples, fastq_dump, fastqdir, aspera, overwrite, passonly, no_downloads, ncpus)
91 .decode('utf-8').split(':')[-1].strip())
92 fastq_dump_minversion = '2.8'
---> 93 if not (distutils.version.LooseVersion(fastq_dump_version) >=
94 distutils.version.LooseVersion(fastq_dump_minversion)):
95 raise RuntimeError("fastq-dump version {0} is installed. You need "
/usr/lib/python3.8/distutils/version.py in __ge__(self, other)
68
69 def __ge__(self, other):
---> 70 c = self._cmp(other)
71 if c is NotImplemented:
72 return c
/usr/lib/python3.8/distutils/version.py in _cmp(self, other)
335 if self.version == other.version:
336 return 0
--> 337 if self.version < other.version:
338 return -1
339 if self.version > other.version:
TypeError: '<' not supported between instances of 'str' and 'int'
The issue is with your fastq-dump version. Looking at the source code that generates the error from sra.py:
fastq_dump_version = (subprocess.check_output([fastq_dump, '--version'])
.decode('utf-8')
.replace('"fastq-dump" version', '').split(':'))
if len(fastq_dump_version) == 1:
fastq_dump_version = fastq_dump_version[0].strip()
elif len(fastq_dump_version) == 2:
fastq_dump_version = fastq_dump_version[1].strip()
else:
fastq_dump_version = (subprocess.check_output([fastq_dump, '--help'])
.decode('utf-8').split(':')[-1].strip())
fastq_dump_minversion = '2.8'
if not (distutils.version.LooseVersion(fastq_dump_version) >=
distutils.version.LooseVersion(fastq_dump_minversion)):
raise RuntimeError("fastq-dump version {0} is installed. You need "
"at least version {1}".format(fastq_dump_version,
fastq_dump_minversion))
There is an assumption about the output of fastq-dump --version, i.e. that there is a : right before the version being output. This is not the case for 2.11 though and the subprocess call results in this:
>>> (subprocess.check_output(['sratoolkit.2.11.0-ubuntu64/bin/fastq-dump', '--version']).decode('utf-8').replace('"fastq-dump" version', '').split(':'))
['\n"sratoolkit.2.11.0-ubuntu64/bin/fastq-dump" version 2.11.0\n\n']
this string is then used for the version comparison further down and distutils complains about being unable to compare it to the version 2.8 saved in fastq_dump_minversion.
The easiest way to fix this is to use another version of the sra toolkit. Version 2.9 should work, as the version output seems to match the expectation:
>>> (subprocess.check_output(['sratoolkit.2.9.0-ubuntu64/bin/fastq-dump', '--version']).decode('utf-8').replace('"fastq-dump" version', '').split(':'))
['\nsratoolkit.2.9.0-ubuntu64/bin/fastq-dump ', ' 2.9.0\n\n']
Additional Info
Why did changing lib/python3.7/distutils/version.py not do the trick? There is a precompiled file in lib/python3.7/distutils/__pycache__ that is being read instead or the actual lib/python3.7/distutils/version.py. If you edit version.py, you should delete the coresponding file in the __pycache__ dir. Note though, that I strongly recommend to not mess with these files, as you can easily break your python if you don't know what you are doing.
P.S.
This should be fixed in dms_tools version 2.6.11
First of all, the error TypeError: '<' not supported between instances of 'str' and 'int' means that one of the operands you are using in condition checking is string data type and another is integer data type.
You can check what's what by using type() function.
Next what you can do is rename the file using mv command and run again using:
python <filename.py>
In short: you are trying to compare two different data types.
If you're sure that both values are number, you can convert the value before compare:
if int(self.version) < int(other.version):

save large numpy array as .mat file

I'm struggling with this problem:
I've 2 large 2D numpy arrays (about 5 GB) and I want to save them in a .mat file loadable from Matlab
I tried scipy.io and wrote
from scipy.io import savemat
data = {'A': a, 'B': b}
savemat('myfile.mat', data, appendmat=True, format='5',
long_field_names=False, do_compression=False, oned_as='row')
but I get the error: OverflowError: Python int too large to convert to C long
EDIT:
Python 3.8, Matlab 2017b
Here the traceback
a.shape (600,1048261) of type <class 'numpy.float64'>
b.shape (1048261) of type <class 'numpy.float64'>
data = {'A': a, 'B': b}
savemat('myfile.mat', data, appendmat=True, format='5',
long_field_names=False, do_compression=False, oned_as='row')
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-19-4d1d08a54148> in <module>
1 data = {'A': a, 'B': b}
----> 2 savemat('myfile.mat', data, appendmat=True, format='5',
3 long_field_names=False, do_compression=False, oned_as='row')
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio.py in savemat(file_name, mdict, appendmat, format, long_field_names, do_compression, oned_as)
277 else:
278 raise ValueError("Format should be '4' or '5'")
--> 279 MW.put_variables(mdict)
280
281
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in put_variables(self, mdict, write_header)
847 self.file_stream.write(out_str)
848 else: # not compressing
--> 849 self._matrix_writer.write_top(var, asbytes(name), is_global)
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in write_top(self, arr, name, is_global)
588 self._var_name = name
589 # write the header and data
--> 590 self.write(arr)
591
592 def write(self, arr):
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in write(self, arr)
627 self.write_char(narr, codec)
628 else:
--> 629 self.write_numeric(narr)
630 self.update_matrix_tag(mat_tag_pos)
631
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in write_numeric(self, arr)
653 self.write_element(arr.imag)
654 else:
--> 655 self.write_element(arr)
656
657 def write_char(self, arr, codec='ascii'):
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in write_element(self, arr, mdtype)
494 self.write_smalldata_element(arr, mdtype, byte_count)
495 else:
--> 496 self.write_regular_element(arr, mdtype, byte_count)
497
498 def write_smalldata_element(self, arr, mdtype, byte_count):
~\miniconda3\envs\work\lib\site-packages\scipy\io\matlab\mio5.py in write_regular_element(self, arr, mdtype, byte_count)
508 tag = np.zeros((), NDT_TAG_FULL)
509 tag['mdtype'] = mdtype
--> 510 tag['byte_count'] = byte_count
511 self.write_bytes(tag)
512 self.write_bytes(arr)
OverflowError: Python int too large to convert to C long
I tried also with hdf5storage
hdf5storage.write(data, 'myfile.mat', matlab_compatible=True)
but it fails too.
EDIT:
gives this warning
\miniconda3\envs\work\lib\site-packages\hdf5storage\__init__.py:1306:
H5pyDeprecationWarning: The default file mode will change to 'r' (read-only)
in h5py 3.0. To suppress this warning, pass the mode you need to
h5py.File(), or set the global default h5.get_config().default_file_mode, or
set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are:
'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
f = h5py.File(filename)
Anyway, it creates a 5GB file but when I load it in Matlab I get a variable named with the file path and apparently without data.
Lastly I tried with h5py:
import h5py
hf = h5py.File('C:/Users/flavio/Desktop/STRA-pattern.mat', 'w')
hf.create_dataset('A', data=a)
hf.create_dataset('B', data=b)
hf.close()
but the output file in not recognized/readable in Matlab.
Is splitting the only solution? Hope there is a better way to fix this issue.
Anyone still looking for an answer, this works with hdf5storage
hdf5storage.savemat(
save_path,
data_dict,
format=7.3,
matlab_compatible=True,
compress=False
)

using pool.map to apply function to list of strings in parallel?

I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.
I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.
Here is what I have so far:
from ua_parser import user_agent_parser
http_str = df['user_agents'].tolist()
def uaparse(http_str):
for i, item in enumerate(http_str):
return user_agent_parser.Parse(http_str[i])
pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))
Right now I'm seeing the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-701fbf58d263> in <module>()
7
8 pool = mp.Pool(processes=10)
----> 9 results = pool.map(uaparse, range(0,len(http_str)))
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
249 '''
250 assert self._state == RUN
--> 251 return self.map_async(func, iterable, chunksize).get()
252
253 def imap(self, func, iterable, chunksize=1):
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
TypeError: 'int' object is not iterable
Thanks in advance for any assistance/direction you can provide.
It seems like all you need is:
http_str = df['user_agents'].tolist()
pool = mp.Pool(processes=10)
parsed = pool.map(user_agent_parser.Parse, http_str)

Unit test fails just changing from Python 2.6.5 to Python 2.7.3; Decimal-related

All my unit tests succeed running in Python 2.6.5; one fails when I run through Python 2.7.3. The code being tested is complex and involves lots of working in floats and converting to Decimal along the way, by converting to str first as was needed in Python 2.6.
Before I start digging, I was wondering if I could be a bit lazy and see if someone has seen this before and has suggestions on what to search for. Here's the result of the test run:
======================================================================
FAIL: test_hor_tpost_winsize_inside_mm (__main__.Test_ShutterCalculator)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_ShutterCalculator.py", line 506, in test_hor_tpost_winsize_inside_mm
self.assertEqual(o.net_width_closing_tolerance, Decimal("6.4"))
AssertionError: Decimal('6.3') != Decimal('6.4')
----------------------------------------------------------------------
Here's the unit test code for test_hor_tpost_winsize_inside_mm():
490 def test_hor_tpost_winsize_inside_mm(self):
491 """
492 Same as above but test mm
493 """
494 o = self.opening
495 o.unit_of_measure = "millimeters"
496 o.formula_mode = "winsize"
497 o.mount = "inside"
498 o.given_width = Decimal("1117.6")
499 o.given_height = Decimal("2365.4")
500 o.louver_spacing = Decimal("101.6")
501 self.make4SidedFrame("9613", '9613: 2-1/2" Face Deco Z', Decimal("63.5"), Decimal("19.1"))
502 so1 = o.subopenings[(0,0)]
503 so1.fold_left = 1
504 so1.fold_right = 1
505 self.calc()
506 self.assertEqual(o.net_width_closing_tolerance, Decimal("6.4"))
507 self.assertEqual(o.net_height_closing_tolerance, Decimal("6.4"))
508 self.assertEqual(o.horizontal_shim, Decimal(".125")) # in inches
509 self.assertEqual(o.vertical_shim, Decimal(".125")) # in inches
510 self.assertEqual(o.width, Decimal("1069.8")) ## 1070 converted directly from inches
511 self.assertEqual(o.height, Decimal("2317.6")) ## 2317.8 converted directy from inches
512 tpost = o.add_hor_tpost()
513 so2 = o.subopenings[(0,1)]
514 so2.fold_left = 1
515 so2.fold_right = 1
516 self.calc()
517 #self.cs()
518 self.assertEqual(o.net_width_closing_tolerance, Decimal("6.4"))
519 self.assertEqual(o.net_height_closing_tolerance, Decimal("12.7"))
520 self.assertEqual(o.horizontal_shim, Decimal(".125")) # in inches
521 self.assertEqual(o.vertical_shim, Decimal(".125")) # in inches
522 self.assertEqual(o.width, Decimal("1069.8")) ## Rick had 42 but agreed that mine is right
523 self.assertEqual(o.height, Decimal("2311.3"))
524 self.assertEqual(so1.width, Decimal("1069.8"))
525 self.assertEqual(so2.width, Decimal("1069.8"))
526 self.assertEqual(so1.height, Decimal("1139.7")) ## Rick had 44.8125 but agreed mine is right
527 self.assertEqual(so2.height, Decimal("1139.7"))
528 self.assertEqual(tpost.center_pos, Decimal("1182.7"))
529 top_panel_section = so1.panels[0].sections[(0,0)]
530 bottom_panel_section = so2.panels[0].sections[(0,0)]
531 self.assertEqual(top_panel_section.louver_count, 9)
532 self.assertEqual(bottom_panel_section.louver_count, 9)
533 self.assertEqual(top_panel_section.top_rail.width, Decimal("112.6")) ## Rick had 4.40625, but given the changes to net
534 self.assertEqual(bottom_panel_section.bottom_rail.width, Decimal("112.7"))
535 self.assertEqual(top_panel_section.bottom_rail.width, Decimal("112.7"))
536 self.assertEqual(bottom_panel_section.top_rail.width, Decimal("112.6"))
Any hint on what to search for in my code to find the source of the discrepancy?
Python 2.7 introduced changes to the Decimal class and float type to help improve accuracy when converting from strings. This could be the source of the change.
Conversions between floating-point numbers and strings are now correctly rounded on most platforms. These conversions occur in many different places: str() on floats and complex numbers; the float and complex constructors; numeric formatting; serializing and deserializing floats and complex numbers using the marshal, pickle and json modules; parsing of float and imaginary literals in Python code; and Decimal-to-float conversion.
You can see the change details here, under "Other language changes"

Categories

Resources