Strip all string and make Numpy array from list

Strip all string and make Numpy array from list - python

I have a list it contains dictionaries that hold string and float data eg. [{a:'DOB', b:'weight', c:height}, {a:12.2, b:12.5, c:11.5}, {a:'DOB', b:33.5, c:33.2}] as such:
I want to convert this to numpy and strip all keys and string values so only the float values pass into the numpy array then I want to use this to work out some stats. eg [[12.2,12.5,11.5], ['', 33.5, 33.2]]
where the whole row is string it will be omitted but where the item in a row is string it should keep a null value.
I'm not sure how to achieve this.

This answer combines all the suggestions in the comments. The procedure loops thru the initial list of dictionaries and does the following:
Creates a new list is using list compression, saving each dictionary value as float, or None for non-floats.
Counts # of float values in the new list and appends if there is at least 1 float.
Creates an array from the new list using np.array().
I added missing quotes to starting code in original post.
Also, in the future posts you should at least make an effort to code something, then ask for help where you get stuck.
test_d = [{'a':'DOB', 'b':'weight', 'c':'height'},
{'a':12.2, 'b':12.5, 'c':11.5},
{'a':'DOB', 'b':33.5, 'c':33.2}]
arr_list = []
for d in test_d:
d_list = [x if isinstance(x, float) else None for x in d.values()]
check = sum(isinstance(x, float) for x in d_list)
if check > 0:
arr_list.append(d_list)
print (arr_list)
arr = np.array(arr_list)
print(arr)
For reference, here is the list compression converted to a standard loop with if/else logic:
for d in test_d:
# List compression converted to a loop with if/else below:
d_list = []
for x in d.values():
if isinstance(x, float):
d_list.append(x)
else:
d_list.append(None)

Related

How to change string elements in a list from strings to numpy array names?

I have a python list, like so:
list = [('array_1','array_2'),('array_1','array_3'),('array_2','array_3')]
The pairs in the list above are actually named numpy arrays, so I want to remove the quotes around each array name so that I'm left with:
list = [(array_1, array_2), (array_1, array_3), (array_2, array_3)]
How do I go about doing this?

Now you lst will contain actual NumPy arrays instead of just strings.
lst = [("array_1", "array_2"), ("array_1", "array_3"), ("array_2", "array_3")]
lst = [(globals()[i], globals()[j]) for i, j in lst]

This will output the numpy array (like array_1):
# str will be like : 'array_1'
globals()[str]
or
eval(str)
Note :
But I recommend to create a dictionary with keys as the strings and values as the corresponding arrays instead of eval and globals()
like this:
dict_ = {'array_1': array_1, 'array_2': array_2, 'array_3': array_3}
And use this dictionary wherever you want to access the variable

How do I figure out what this code is doing?

I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?

Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.

You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.

How to access an array that contains float variables in python

I have a defined function:
def makeRandomList (values) :
length = len(values)
new_list = []
for i in range(length) :
random_num = random.randint(0, length-1)
new_list.append(values[random_num]*1.0) #take random samples
return new_list
which should just take some samples from an input array values. I have imported such an array as a .csv spreadsheet. Two problems occur:
The array should look like this:
['0', '0']
['1.200408', '29629.0550890999947']
['2.438112', '322162.385751669993']
['3.443816', '511142.915559189975']
['4.500272', '703984.472568470051']
['5.505976', '579295.304300419985']
['6.562432', '703984.472568470051']
['7.568136', '579295.304300419985']
['8.624592', '703984.472568470051']
Which I know through these lines:
import csv
with open('ThruputCSV.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
v = []
for row in readCSV:
print(row)
When instead of typing print(row) using v.append(row[1]) the resulting v looks like this:
['',
'0',
'29629.0550890999947',
'322162.385751669993',
'511142.915559189975',
'703984.472568470051',
'579295.304300419985',
'703984.472568470051',
'579295.304300419985',
'703984.472568470051']
which is correct exept for the first entry ? Why is the first entry empty?
Now, when running a code (if you're interested, it has kindly been distributed by one user here) , the makeRandomListfunction given v as the values parameter throws an error: can't multiply sequence by non-int of type float
I cannot figure out what is the error - to me v seems to be an array that contains float values. And this should be fine, because the error occurs in this line: new_list.append(values[random_num]*1.0) in which random_num, some integer value, just gives the index of the v array which I want to access. Does this mean I am not allowed to use append with an array that contains float variables?

You are reading that error wrong... It's not the float value that is an issue
The error is say a 'sequence' cannot be multiplied by a float. Floats and ints can be multiplied with each other. Sequences can NOT be iterated over and multiplied by either floats or ints.
The actual problem is that your array values are strings. Note the ' around each one of them. Those are considered sequences. Convert them to floats and your code will work.
for i in range(length) :
random_num = random.randint(0, length-1)
new_list.append(float(values[random_num])*1.0)
Edit:
It was pointed out that I originally said sequences cannot be multiplied by floats or ints. The clarify. An array of sequences cannot be ITERATED over AND multiply by an int/float at the same time. If you just multiply the whole sequence by an int it will copy all the elements within the array. Useful knowledge in some cases, however that still does not solve this particular question.

Concatenate dicts of numpy arrays retaining numpy dtype

I'm concatenating python dicts within a loop (not shown). I declare a new empty dict (dsst_mean_all) on the first instance of the loop:
if station_index == 0:
dsst_mean_all = {}
for key in dsst_mean:
dsst_mean_all[key] = []
source = [dsst_mean_all, dsst_mean]
for key in source[0]:
dsst_mean_all[key] = np.concatenate([d[key] for d in source])
and then, as you can see in the second part of the code above, I concatenate the dict that has been obtained within the loop (dsst_mean) with the large dict that's going to hold all the data (dsst_mean_all).
Now dsst_mean is a dict whose elements are numpy arrays of different types. Mostly they are float32. My question is, how can I retain the datatype during concatenation? My dsst_mean_all dict ends up being float64 numpy arrays for all elements. I need these to match dsst_mean to save memory and reduce file size. Note that dsst_mean for all iterations of the loop has the same structure and elements of the same dtype.
Thanks.

You can define the dtype of your arrays in the list comprehension.
Either hardecoded:
dsst_mean_all[key] = np.concatenate([d[key].astype('float32') for d in source])
Or dynamic:
dsst_mean_all[key] = np.concatenate([d[key].astype(d[key].dtype) for d in source])
Docs: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

Ok one way to solve this is to avoid declaring dsst_mean_all as a new empty dict. This - I think - is why everything is being cast to float64 by default. With an if/else statement, on the first iteration simply set dsst_mean_all to dsst_mean, whilst for all subsequent iterations do the concatenation as shown in my original question.

Best method to store in python [duplicate]

I'm trying to add items to an array in python.
I run
array = {}
Then, I try to add something to this array by doing:
array.append(valueToBeInserted)
There doesn't seem to be a .append method for this. How do I add items to an array?

{} represents an empty dictionary, not an array/list. For lists or arrays, you need [].
To initialize an empty list do this:
my_list = []
or
my_list = list()
To add elements to the list, use append
my_list.append(12)
To extend the list to include the elements from another list use extend
my_list.extend([1,2,3,4])
my_list
--> [12,1,2,3,4]
To remove an element from a list use remove
my_list.remove(2)
Dictionaries represent a collection of key/value pairs also known as an associative array or a map.
To initialize an empty dictionary use {} or dict()
Dictionaries have keys and values
my_dict = {'key':'value', 'another_key' : 0}
To extend a dictionary with the contents of another dictionary you may use the update method
my_dict.update({'third_key' : 1})
To remove a value from a dictionary
del my_dict['key']

If you do it this way:
array = {}
you are making a dictionary, not an array.
If you need an array (which is called a list in python ) you declare it like this:
array = []
Then you can add items like this:
array.append('a')

Arrays (called list in python) use the [] notation. {} is for dict (also called hash tables, associated arrays, etc in other languages) so you won't have 'append' for a dict.
If you actually want an array (list), use:
array = []
array.append(valueToBeInserted)

Just for sake of completion, you can also do this:
array = []
array += [valueToBeInserted]
If it's a list of strings, this will also work:
array += 'string'

In some languages like JAVA you define an array using curly braces as following but in python it has a different meaning:
Java:
int[] myIntArray = {1,2,3};
String[] myStringArray = {"a","b","c"};
However, in Python, curly braces are used to define dictionaries, which needs a key:value assignment as {'a':1, 'b':2}
To actually define an array (which is actually called list in python) you can do:
Python:
mylist = [1,2,3]
or other examples like:
mylist = list()
mylist.append(1)
mylist.append(2)
mylist.append(3)
print(mylist)
>>> [1,2,3]

You can also do:
array = numpy.append(array, value)
Note that the numpy.append() method returns a new object, so if you want to modify your initial array, you have to write: array = ...

Isn't it a good idea to learn how to create an array in the most performant way?
It's really simple to create and insert an values into an array:
my_array = ["B","C","D","E","F"]
But, now we have two ways to insert one more value into this array:
Slow mode:
my_array.insert(0,"A") - moves all values to the right when entering an "A" in the zero position:
"A" --> "B","C","D","E","F"
Fast mode:
my_array.append("A")
Adds the value "A" to the last position of the array, without touching the other positions:
"B","C","D","E","F", "A"
If you need to display the sorted data, do so later when necessary. Use the way that is most useful to you, but it is interesting to understand the performance of each method.

I believe you are all wrong. you need to do:
array = array[] in order to define it, and then:
array.append ["hello"] to add to it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip all string and make Numpy array from list - python

Related

How to change string elements in a list from strings to numpy array names?

How do I figure out what this code is doing?

How to access an array that contains float variables in python

Concatenate dicts of numpy arrays retaining numpy dtype

Best method to store in python [duplicate]

Categories

Resources