Query dataframe by column name as a variable - python

I know this question has already been asked here, but my question a bit different. Lets say I have following df:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'b'), 'B': ('a', 'a', 'g', 'l', 'e', 'a', 'b'), 'C': ('b', 'b', 'g', 'a', 'e', 'a', 'b')})
myList = ['a', 'e', 'b']
I use this line to count the total number of occurrence of each elements of myList in my df columns:
print(df.query('A in #myList ').A.count())
5
Now, I am trying to execute the same thing by looping through columns names. Something like this:
for col in df.columns:
print(df.query('col in #myList ').col.count())
Also, I was wondering if using query for this is the most efficient way?
Thanks for the help.

Use this :
df.isin(myList).sum()
A 5
B 5
C 6
dtype: int64
It checks every cell in the dataframe through myList and returns True or False. Sum uses the 1 or 0 reference and gets the total for each column

Related

Merging dataframes where the common column has repeating values

I would like to merge several sensor files which have a common column as "date" whose value is the time the sensor data was logged in. These sensors log the data every second. My task is to join these sensor data into one big dataframe. Since there could be a millisecond difference between the exact time the sensor data is logged in, we have created a window of 30 seconds using pandas pd.DatetimeIndex.floor method. Now I want to merge these files using the "date" column. The following is an example I was working on:
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
It is not necessary that the different sensor files will have a same amount of data. The sensor data looks like the below. The vertical axis could relate to the time (increasing downward). The second (B) and second last window (C) should overlap as they belong to the same time window.
The resultant dataframe should look something like that:
The A, B, C, and D values represent 30 sec window (for example, 'A' could be 07:00:00, 'B' could be 07:00:30, 'C' could be 07:01:00, and D could be 07:01:30). Now as we can see, the starting and ending window could be less than 30 (since sensor logs data every second, each window should have 30 values. In the example the number of rows of B and C window should be 30 each, not 6 as shown in the example). The reason is if the sensor has started reporting the values at 07:00:27, then it falls in the window of 'A' but could report only 3 values. Similarly, if the sensors has stopped reporting the values at 07:01:04, then it falls in the window of C but could report only 4 values. However, B and C windows will always have 30 values (In the example I have shown only 6 for ease of understanding).
I would like to merge the dataframes such that the values from the same window overlap as shown in figure (B and C) while the start and end windows, should show NaN values where there is no data. (In the above example, Value1 from sensor1 started reporting data 1 second earlier while Value2 from sensor 2 stopped reporting data 2 seconds after sensor1 stopped reporting).
How to achieve such joins in the pandas?
You can build your DataFrame with the following solution that requires only built-in Python structures. I don't see a particular interest in trying to use pandas methods. I'm not even sure that we can achieve this result only with pandas methods because you handle each value column differently, but I'm curious if you find a way.
from collections import defaultdict
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
# Part 1
datas = [data1, data2]
## Compute where to fill dicts with NaNs
dates = sorted(set(data1["date"] + data2["date"]))
dds = [{} for i in range(2)]
for d in dates:
for i in range(2):
dds[i][d] = [v for k, v in zip(datas[i]["date"], datas[i]["value%i" % (i + 1)]) if k == d]
## Fill dicts
nan = float("nan")
for d in dates:
n1, n2 = map(len, [dd[d] for dd in dds])
if n1 < n2:
dds[0][d] += (n2 - n1) * [nan]
elif n1 > n2:
dds[1][d] = (n1 - n2) * [nan] + dds[1][d]
# Part 2: Build the filled data columns
data = defaultdict(list)
for d in dates:
n = len(dds[0][d])
data["date"] += d * n
for i in range(2):
data["value%i" % (i + 1)] += dds[i][d]
data = pd.DataFrame(data)
if I understand the question correctly, you're might be looking for something like this:
data1 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
})
data2 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
})
b = pandas.concat([data1, data2]).sort_values(by='date', ascending=True)

Return a list with dataframe column values ordered based on another list

I have a df with columns a-h, and I wish to create a list of these column values, but in the order of values in another list (list1). list1 corresponds to the index value in df.
df
a b c d e f g h
list1
[3,1,0,5,2,7,4,6]
Desired list
['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g']
You can just do df.columns[list1]:
import pandas as pd
df = pd.DataFrame([], columns=list('abcdefgh'))
list1 = [3,1,0,5,2,7,4,6]
print(df.columns[list1])
# Index(['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g'], dtype='object')
First get a np.array of alphabets
arr = np.array(list('abcdefgh'))
Or in your case, a list of your df columns
arr = np.array(df.columns)
Then use your indices as a indexing mask
arr[[3,1,0]]
out:
['d', 'b', 'a']
Check
df.columns.to_series()[list1].tolist()

Is there a pandas method to do the opposite of "pandas.factorize" on dataframe columns?

Is there any pandas method to unfactor a dataframe column? I could not find any in the documentation, but was expecting something similar to unfactor in R language.
I managed to come up with the following code, for reconstructing the column (assuming none of the column values are missing), by using the labels array values as indices of uniques.
orig_col = ['b', 'b', 'a', 'c', 'b']
labels, uniques = pd.factorize(orig_col)
recon_col = np.array([uniques[label] for label in labels]).tolist()
orig_col == recon_col
orig_col = ['b', 'b', 'a', 'c', 'b']
labels, uniques = pd.factorize(orig_col)
# To get original list back
uniques[labels]
# array(['b', 'b', 'a', 'c', 'b'], dtype=object)
Yes we can do it via np.vectorize and create the dict
np.vectorize(dict(zip(range(len(uniques)),uniques)).get)(labels)
array(['b', 'b', 'a', 'c', 'b'], dtype='<U1')

is there a simpler way to group and count with python?

I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.
Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64
Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})
There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])

Cartesian products of lists without duplicates

Given an array a=['a','b','c'], how would you go about returning the Cartesian product of the array without duplicates. Example:
[['a', 'a' , 'a' ,'a']
['a', 'a' , 'a' ,'b']
['a', 'a' , 'a' ,'c']
['a', 'a' , 'b' ,'b']
['a', 'a' , 'b' ,'c']
['a', 'a' , 'c' ,'c']
...etc..]
Following How to generate all permutations of a list in Python, I tried :
print list(itertools.permutations(['a', 'b' , 'c'], 4))
[]
print list(itertools.product(['a', 'b' , 'c'], repeat=4)
But I get the Cartesian product with duplicates. For example the list will contain both ['a','a','b','b'] and ['a','b','b','a'] which are clearly the equal.
Note: my 'a','b','c' are variables which store numbers say 1,2,3. So after getting the list of combinations of the letters, I would need to: say,
['a','b','c','c'] ----> a*b*c*c = 1*2*3*3 = 18
What is the fastest way of doing this in python? Would it be possible/faster to do it with numpy??
Thanks!
Maybe you actually want combinations_with_replacement?
>>> from itertools import combinations_with_replacement
>>> a = ['a', 'b', 'c']
>>> c = combinations_with_replacement(a, 4)
>>> for x in c:
... print x
...
('a', 'a', 'a', 'a')
('a', 'a', 'a', 'b')
('a', 'a', 'a', 'c')
('a', 'a', 'b', 'b')
('a', 'a', 'b', 'c')
('a', 'a', 'c', 'c')
('a', 'b', 'b', 'b')
('a', 'b', 'b', 'c')
('a', 'b', 'c', 'c')
('a', 'c', 'c', 'c')
('b', 'b', 'b', 'b')
('b', 'b', 'b', 'c')
('b', 'b', 'c', 'c')
('b', 'c', 'c', 'c')
('c', 'c', 'c', 'c')
Without more information about how you're mapping strings to numbers I can't comment on your second question, but writing your own product function or using numpy's isn't too difficult.
Edit: Don't use this; use the other answer
If your original set is guaranteed uniqueness, then the `combinations_with_replacement` solution will work. If not, you can first pass it through `set()` to get it down to unique variables. Regarding the product, assuming you have the values stored in a dictionary `values` and that all the variables are valid python identifiers, you can do something like the following
combos = combinations_with_replacement(a, 4)
product_strings = ['*'.join(c) for c in combos]
products = [eval(s, globals(), values) for s in product_strings]
Needless to say, be very careful with eval. Only use this solution if you are creating the list a.
Example exploit: a = ['from os import', '; system("rm -rf .");']

Categories

Resources