Python Get rid of rows with Nans in only one column - python

I am new python, thank you for all your help in advance!
I am having a lot of trouble accomplishing something in Python that is very easy to do in Excel.
I have a pandas data frame that looks like this:
df = pd.DataFrame(
{'c1': [1,2,3,4,5],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]})
Notice I have NaN values in columns c2 and c3.
I want to remove all rows with NaN in c2.
So the result should look like this:
c1: [1,2,3,5]
c2: [4,6,7,3]
c3: [0,Nan,3,4]
I tried all sorts of list comprehensions but they either contain bugs or won't give me the correct result.
I think this is close:
[x for x in df["c2"] if x != None]

You don't need a list comprehension, for a pure pandas solution:
df.dropna(subset=['c2'])
subset allows you to select columns to inspect.

You're very close:
d = {'c1': [1,2,3,4,5],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]}
for k in d:
d[k] = [x for x in d[k] if x != None]
df= pd.DataFrame(d)

Since all your columns are stored as lists, you can use c2.index(None) to get the index of None in c2. Then remove that index from each list using pop(). More documentation here: https://docs.python.org/2/tutorial/datastructures.html

Given this data:
data = {
'c1': [4,6,7,None,3],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]
}
Removal of the first instance:
The values of equal to None can most efficiently be removed as follows:
ind = data['c2'].index(None)
data['c2'].pop(ind)
You may wish to implement a function to automate this:
def remove(data_set, item, value):
ind = data_set[item].index(value)
return data_set.pop[ind]
Removal of all instances:
Notice that this will remove only the first occurrence of None, or any other values. To remove them all occurrences efficiently and without iteration, you may wish to do as follows:
tmp = set(data['c2']) - set([None]*len(data['c2']))
data['c2'] = list(tmp)
or define a function:
def remove(data_set, item, value):
response = set(data_set[item]) - set([value] * len(data_set[item]))
return list(response)
whereby:
data['c2'] = remove(data, 'c2', None)
Comparison of results:
All the above return this for c2:
[4, 6, 7, 3]
The first 2 solutions, applied to c3, return:
[0, 3, None, 4]
whereas the last 2 solutions, however, return as follows if applied to c3:
[0, 3, 4]
Hope you find this helpful.

Related

How to merge values of two arrays into one? [duplicate]

This question already has answers here:
How to merge lists into a list of tuples?
(10 answers)
Closed 1 year ago.
How to merge two arrays as value-pairs in python?
Example as follows:
A = [0,2,2,3]
B = [1,1,4,4]
Output:
[[0,1],[2,1],[2,4],[3,4]]
You can simply use "zip"
l1 = [0,2,2,3]
l2 = [1,1,4,4]
print(list(map(list ,zip(l1,l2))))
In addition to Greg's answer, if you need key value pairs cast your zip result to dict
l1 = [0,2,2,3]
l2 = [1,1,4,4]
print(dict(zip(l1,l2)))
Output
{0: 1, 2: 4, 3: 4}
Before creating any loop, try to use built-ins.
Also there is a similar question for your need
Zip with list output instead of tuple
You can simultaniously itterate through using zip(), and append a result list with each of the pairs like so:
A = [0,2,2,3]
B = [1,1,4,4]
result = []
for item1, item2 in zip(A,B):
result.append([item1, item2])
Output = [[0,1],[2,1],[2,4],[3,4]]
print(result) # Prints: [[0,1],[2,1],[2,4],[3,4]]
print(Output == result) # Prints: True
This would give you a list of lists like you were looking for in your question as an output.
Things to keep in mind
If the two starting lists are different sizes then zip() throws away values after one of the lists runs out, so with :
A = [0,2,2,3,4,5]
B = [1,1,4,4]
result = []
for item1, item2 in zip(A,B):
result.append([item1, item2])
Output = [[0,1],[2,1],[2,4],[3,4]]
print(result) # Prints: [[0,1],[2,1],[2,4],[3,4]]
print(Output == result) # Prints: True
You notice that the 4 and 5 in list A is thrown out and ignored.
Key-Value Pair
Also this is not a key-value pair, for that you will want to look into dictionaries in python. That would be something like:
output = {0:1, 2:4, 3:4}
This would allow you to do a lookup for a value, based on it's key like so:
output[3] # Would be 4
output[0] # Would be 1
Which doesn't work for this example because there are two 2's used as keys, so one would be overridden.
Since you have mentioned key-value, you probably mean a dictionary.
A = [0, 2, 2, 3]
B = [1, 1, 4, 4]
dct = {} # Empty dictionary
for key, value in zip(A, B):
dct[key] = value
print(dct)
The output will be:
{0: 1, 2: 4, 3: 4}
Note that, by definition, you can't have two identical keys. So in your case, {2: 1} will be overriden by {2: 4}.

Pandas dataframes custom ordering

In one column, I have 4 possible (non-sequential) values: A, 2, +, ? and I want order rows according to a custom sequence 2, ?, A, +, I followed some code I followed online:
order_by_custom = pd.CategoricalDtype(['2', '?', 'A', '+'], ordered=True)
df['column_name'].astype(order_by_custom)
df.sort_values('column_name', ignore_index=True)
But for some reason, although it does sort, it still does so according to alphabetical (or binary value) position rather than the order I've entered them in the order_by_custom object.
Any ideas?
.astype does return Series after conversion, but you did not anything with it. Try assigning it to your df. Consider following example:
import pandas as pd
df = pd.DataFrame({'orderno':[1,2,3],'custom':['X','Y','Z']})
order_by_custom = pd.CategoricalDtype(['Z', 'Y', 'X'], ordered=True)
df['custom'] = df['custom'].astype(order_by_custom)
print(df.sort_values('custom'))
output
orderno custom
2 3 Z
1 2 Y
0 1 X
You can use a customized dictionary to sort it. For example a dictionary will be as:
my_custom_dict = {'2': 0, '?': 1, 'A': 2, '+' : 3}
If your column name is "my_column_name" then,
df.sort_values(by=['my_column_name'], key=lambda x: x.map(my_custom_dict))

How to compare index values of list in default dict in python

I have a default dict d in python which contains two list in it as below:
{
'data1': [0.8409093126477928, 0.9609093126477928, 0.642217399079215, 0.577003839123445, 0.7024399719949195, 1.0739533732043967],
'data2': [0.9662666242560285, 0.9235637581239243, 0.8947656867577896, 0.9266919525550584, 1.0220039913024457]
}
In future there can be many list in in default dict like data1, data2, data3, data4 etc. I need to compare the index values of default dict with each other. So for above default dict I need to check weather data1[0]->0.8409093126477928 is smaller than data2[0]->0.9662666242560285 or not and same goes for other index, and store the result of wining list index in separate list like below:
result = ['data1', 'data2', 'data1', 'data1', 'data1']
If length of any list is greater than other list, we simply need to check if the last index value is smaller than 1 or not. Like data1[5] cannot be compared with data2[5] because there is no value of data2[5] thus we will simply check if data1[5] is less than 1 or not. If its less than 1 then we will consider it and add it to result otherwise ignore it and will not save it in result.
To resolve this I thought, of extracting the list from default dict to separate list and then using a for loop to compare index values, but when I did print(d[0]) to print the 0th index list, it printed out []. Why is it printing null. How can I compare the index values as above. Please help. Thanks
Edit: as suggested by #ggorlen replaced the custom iterator with zip_longest
I would do it using custom_iterator like this,
zip longest yeild one item from each array in each iteration. for shorter array it will return 1 when iteration goes past its length
The list comprehension loop through the iterator and get 1st index of min item item.index(min(item)) then get the key corresponding to the min value keys[item.index(min(item))]
if selected list is shorter than current iterator index it either skips or give "NA" value
from itertools import zip_longest
keys = list(d.keys())
lengths = list(map(len,d.values()))
result = [keys[item.index(min(item))]
for i, item in enumerate(zip_longest(*d.values(), fillvalue=1))
if lengths[item.index(min(item))]>i]
result
if you want to give default key instead of skip-ing when minimum value found is not less than one
result = [keys[item.index(min(item))] if lengths[item.index(min(item))]>i else "NA"
for i, item in enumerate(zip_longest(*d.values(), fillvalue=1))]
We can use zip_longest from itertools and a variety of loops to achieve the result:
from itertools import zip_longest
result = []
pairs = [[[z, y] for z in x] for y, x in data.items()]
for x in zip_longest(*pairs):
x = [y for y in x if y]
if len(x) > 1:
result.append(min(x, key=lambda x: x[0])[1])
elif x[0][0] < 1:
result.append(x[0][1])
print(result) # => ['data1', 'data2', 'data1', 'data1', 'data1']
First we create pairs of every item in each dict value and its key. This makes it easier to get result keys later. We zip_longest and iterate over the lists, filtering out Nones. If we have more than one element to compare, we take the min and append it to the result, else we check the lone element and keep it if its value is less than 1.
A more verifiable example is
data = {
'foo': [1, 0, 1, 0],
'bar': [1, 1, 1, 1, 0],
'baz': [1, 1, 0, 0, 1, 1, 0],
'quux': [0],
}
which produces
['quux', 'foo', 'baz', 'foo', 'bar', 'baz']
Element-wise, "quux" wins round 0, "foo" wins round 1, "baz" 2, "foo" round 3 thanks to key order (tied with "baz"), "bar" for round 4. For round 5, "baz" is the last one standing but isn't below 1, so nothing is taken. For round 6, "baz" is still the last one standing but since 0 < 1, it's taken.
d = {
'd0': [0.1, 1.1, 0.3],
'd1': [0.4, 0.5, 1.4, 0.3, 1.6],
'd2': [],
}
import itertools
import collections
# sort by length of lists, shortest first and longest last
d = sorted(d.items(), key=lambda k:len(k[1]))
# loop through all combinations possible
for (key1, list1), (key2, list2) in itertools.combinations(d, 2):
result = []
for v1, v2 in itertools.zip_longest(list1, list2): # shorter list is padded with None
# no need to check if v2 is None because of sorting
if v1 is None:
result.append(key2 if v2 < 1 else None)
else:
result.append(key1 if v1 < v2 else key2)
# DO stuff with result, keys, list, etc...
print(f'{key1} vs {key2} = {result}')
Output
d2 vs d0 = ['d0', None, 'd0']
d2 vs d1 = ['d1', 'd1', None, 'd1', None]
d0 vs d1 = ['d0', 'd1', 'd0', 'd1', None]
I sorted them based on the list lengths. This ensures that list1 will always be shorter or of the same length as list2.
For different lengths, the remaining indices will be a mixture of None and key2.
However, when the elements are equal, key2 is added to the result. This might not be the desired behavior.

Convert pandas dataframe to dict to JSON, unflatten nested subkeys, drop None/NaN keys

Can the following be done in Pandas in one go, in more Pythonic code than below?
I have a row from a pandas-dataframe:
some values may be NaNs or empty strings or similar
I'd like to map this information to a dict (which is then converted to JSON and passed on to another application)
However, NaNs should not be included in the dict. (By default they are passed as None)
Dict subkeys 'c.x', 'c.y', 'c.z' should be unflattened, i.e. converted to a subdict c with keys x, y, z. Again, NaN keys in each row should be dropped.
Sample input: I iterate over rows in a dataframe with row = next(df.iterrows()), where a sample row would look like:
a 3
b NaN
c.x 4
c.y 5
c.z NaN
Desired output
{"A": 3,
"C": {"X": 4, "Y": 5}}
The most natural way (to me) to do that would like something like this:
outdict={"A": row['a'] if not pandas.isna(row['a']) else None,
"B": row['b'] if not pandas.isna(row['b']) else None,
"C": {"X": row['c.x'] if not pandas.isna(row['c.x']) else None,
"Y": row['c.y'] if not pandas.isna(row['c.y']) else None,
"Z": row['c.z'] if not pandas.isna(row['c.z']) else None
}}
However, this still assigns None to the slots that I'd like to remain empty (the receiving application is difficult in handlings nulls).
One workaround would be using this code and subsequently removing all None values in a second pass, or I could use outdict.update for each value (and not update if the value is NaN). But both solutions seems not very efficient to me.
To transform your DataFrame to a dictionary without NaN, there is a straightforward way:
df.dropna().to_dict()
But you also want to create sub-dictionaries from composed keys, and I found no other way than a loop:
df = DataFrame({"col": [3, None, 4, 5, None]}, index=["a", "b", "c.x", "c.y", "c.z"])
d = df.dropna().to_dict()
d is:
{'col': {'a': 3.0, 'c.x': 4.0, 'c.y': 5.0}}
Then:
d2 = dict()
for k, v in d['col'].items():
if k.count('.'):
a, b = k.split('.')
d2.setdefault('a', {})
d2[a][b] = v
else:
d2[k] = v
and d2 is:
{'a': 3.0, 'c': {'y': 5.0, 'x': 4.0}}
If row is a Series object, the following code will not create any entries for NaNs:
outdict = {row.index[i]: row[i]
for i in range(data.shape[1])
if not pandas.isna(row[i])}
However, it won't create the nested structure that you want. There are several ways I can think of to solve this, none of which are extremely elegant. The best way I can think of is to exclude the columns with labels of the form a.b when creating outdict; i.e.
outdict = {row.index[i]: row[i]
for i in range(data.shape[1])
if not (pandas.isna(row[i]) or '.' in row.index[i])}
then create the subdicts individually and assign them in outdict.

Summing up numbers in a defaultdict(list)

I've been experimenting trying to get this to work and I've exhausted every idea and web search. Nothing seems to do the trick. I need to sum numbers in a defaultdict(list) and i just need the final result but no matter what i do i can only get to the final result by iterating and returning all sums adding up to the final. What I've been trying generally,
d = { key : [1,2,3] }
running_total = 0
#Iterate values
for value in d.itervalues:
#iterate through list inside value
for x in value:
running_total += x
print running_total
The result is :
1,3,6
I understand its doing this because its iterating through the for loop. What i dont get is how else can i get to each of these list values without using a loop? Or is there some sort of method iv'e overlooked?
To be clear i just want the final number returned e.g. 6
EDIT I neglected a huge factor , the items in the list are timedealta objects so i have to use .seconds to make them into integers for adding. The solutions below make sense and I've tried similar but trying to throw in the .seconds conversion in the sum statement throws an error.
d = { key : [timedelta_Obj1,timedelta_Obj2,timedelta_Obj3] }
I think this will work for you:
sum(td.seconds for sublist in d.itervalues() for td in sublist)
Try this approach:
from datetime import timedelta as TD
d = {'foo' : [TD(seconds=1), TD(seconds=2), TD(seconds=3)],
'bar' : [TD(seconds=4), TD(seconds=5), TD(seconds=6), TD(seconds=7)],
'baz' : [TD(seconds=8)]}
print sum(sum(td.seconds for td in values) for values in d.itervalues())
You could just sum each of the lists in the dictionary, then take one final sum of the returned list.
>>> d = {'foo' : [1,2,3], 'bar' : [4,5,6,7], 'foobar' : [10]}
# sum each value in the dictionary
>>> [sum(d[i]) for i in d]
[10, 6, 22]
# sum each of the sums in the list
>>> sum([sum(d[i]) for i in d])
38
If you don't want to iterate or to use comprehensions you can use this:
d = {'1': [1, 2, 3], '2': [3, 4, 5], '3': [5], '4': [6, 7]}
print(sum(map(sum, d.values())))
If you use Python 2 and your dict has a lot of keys it's better you use imap (from itertools) and itervalues
from itertools import imap
print sum(imap(sum, d.itervalues()))
Your question was how to get the value "without using a loop". Well, you can't. But there is one thing you can do: use the high performance itertools.
If you use chain you won't have an explicit loop in your code. chain manages that for you.
>>> data = {'a': [1, 2, 3], 'b': [10, 20], 'c': [100]}
>>> import itertools
>>> sum(itertools.chain.from_iterable(data.itervalues()))
136
If you have timedelta objects you can use the same recipe.
>>> data = {'a': [timedelta(minutes=1),
timedelta(minutes=2),
timedelta(minutes=3)],
'b': [timedelta(minutes=10),
timedelta(minutes=20)],
'c': [timedelta(minutes=100)]}
>>> sum(td.seconds for td in itertools.chain.from_iterable(data.itervalues()))
8160

Categories

Resources