I'm progressively learning pandas, I figured out that pd.crosstab() can do marvels but I've hard time to make it work in that case.
I have a list of objects obj tagged with an int, I want to have the matrix of the objects sharing the same tag (1 if it's the same, 0 else)
| obj | tag |
|-----|-----|
| a | 0 |
| b | 2 |
| c | 1 |
| ... | ... |
| z | 2 |
->
| | a | b | c | ... | z |
|-----|---|---|---|-----|---|
| a | 1 | 0 | 0 | . | 0 |
| b | 0 | 1 | 0 | . | 1 |
| c | 0 | 0 | 1 | . | 0 |
| ... | . | . | . | . | 0 |
| z | 0 | 1 | 0 | 0 | 1 |
There are some formidables ways to do it, is there one more panda-friendly ?
PS : Tried with pd.crosstab(df.obj, df.obj, values=df.tag, aggfunc=[np.sum]) but NaN filled.
You can use merge with crosstab and DataFrame.rename_axis:
df = df.merge(df, on='tag')
df = pd.crosstab(df.obj_x, df.obj_y).rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c z
a 1 0 0 0
b 0 1 0 1
c 0 0 1 0
z 0 1 0 1
Related
I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.
I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX which is a string object.
dataset1=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
and
dataset2 =
SAX timestamp
0 hssrlcu 16015
1 ktyuymp 16016
2 xncqmfr 16017
3 aanlmna 16018
4 urvahvo 16019
... ... ...
263455 jeivqzo 279470
263456 bzasxgw 279471
263457 jspqnqv 279472
263458 sxwfchj 279473
263459 gxqnhfr 279474
263460 rows × 2 columns
I need to find and print out the timestamps for whenever a value in SAX column of dataset1 exists in SAX column of dataset2.
Is there a function/method for accomplishing the above?
Thanks.
Let's create an arbitrary dataset to showcase how it works:
import pandas as pd
import numpy as np
def sax_generator(num):
return [''.join(chr(x) for x in np.random.randint(97, 97+26, size=4)) for _ in range(num)]
df1 = pd.DataFrame(sax_generator(10), columns=['sax'])
df2 = pd.DataFrame({'sax': sax_generator(10), 'timestamp': range(10)})
Let's peek into the data:
df1 =
| | sax |
|---:|:------|
| 0 | cvtj |
| 1 | fmjy |
| 2 | rjpi |
| 3 | gwtv |
| 4 | qhov |
| 5 | uriu |
| 6 | kpku |
| 7 | xkop |
| 8 | kzoe |
| 9 | nydj |
df2 =
| | sax | timestamp |
|---:|:------|------------:|
| 0 | kzoe | 0 |
| 1 | npyo | 1 |
| 2 | uriu | 2 |
| 3 | hodu | 3 |
| 4 | rdko | 4 |
| 5 | pspn | 5 |
| 6 | qnut | 6 |
| 7 | gtyz | 7 |
| 8 | gfzs | 8 |
| 9 | gcel | 9 |
Now ensure we have some matching values in df2 from df1, which we can later check:
df2['sax'][2] = df1['sax'][5]
df2['sax'][0] = df1['sax'][8]
Then use:
df2.loc[df1.sax.apply(lambda x: df2.sax.str.contains(x)).any(), 'timestamp']
to get:
| | timestamp |
|---:|------------:|
| 0 | 0 |
| 2 | 2 |
With np.where docs here you can get the indices back as well:
np.where(df1.sax.apply(lambda x: df2.sax.str.contains(x)) == True)
# -> (array([5, 8]), array([2, 0]))
Here we can see that df1 has matching indices [5, 8] and df2 has [2, 0], which is exactly what we enforced with the lines above...
If we have a look at the return of df1.sax.apply(lambda x: df2.sax.str.contains(x)), the result above matches exactly the indices (magic...whooo):
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Step1: Convert Dataset 2 to a dict using:
import numpy as np
import pandas as pd
a_dictionary = df.to_dict['list]
Step2: Use a comparator in a for loop to extract time stamps.
lookup_value = "abcdef" #This can be a list item.
all_keys = []
for key, value in a_dictionary.items():
if(value == lookup_value):
all_keys.append(key)
print(all_keys)
Step3: ENJOY!
I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.
In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).
Here is a sample of the original data:
| cond1 | cond2 | cond3 | target |
|-------|-------|-------|--------|
| I219 | E119 | I48 | 1 |
| I500 | | | 0 |
| I48 | I500 | F171 | 1 |
| I219 | E119 | I500 | 0 |
| I219 | I48 | | 0 |
Here's what I tried:
import pandas as pd
df = pd.read_csv('micro.csv', dtype='object')
df['cond1'] = pd.Categorical(df['cond1'])
df['cond2'] = pd.Categorical(df['cond2'])
df['cond3'] = pd.Categorical(df['cond3'])
dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')
dummies
Which gives me:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 |
|-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.
Is there a way to do this? This is the result I'm after:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 |
|-----------|----------|-----------|-----------|-----------|
| 1 | 1 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 |
Get max values if need 1 and 0 data in output:
dfDummies = dummies.max(axis=1, level=0)
Or use sum if need count 1 values:
dfDummies = dummies.sum(axis=1, level=0)
I have a table that looks something like this:
+------------+------------+------------+------------+
| Category_1 | Category_2 | Category_3 | Category_4 |
+------------+------------+------------+------------+
| a | b | b | y |
| a | a | c | y |
| c | c | c | n |
| b | b | c | n |
| a | a | a | y |
+------------+------------+------------+------------+
I'm hoping for a pivot_table like result, with the counts of the frequency for each category. Something like this:
+---+------------+----+----+----+
| | | a | b | c |
+---+------------+----+----+----+
| | Category_1 | 12 | 10 | 40 |
| y | Category_2 | 15 | 48 | 26 |
| | Category_3 | 10 | 2 | 4 |
| | Category_1 | 5 | 6 | 4 |
| n | Category_2 | 9 | 5 | 2 |
| | Category_3 | 8 | 4 | 3 |
+---+------------+----+----+----+
I know I could pull it off by splitting the table, assigning value_counts to column values, then rejoining. Is there any more simple, more 'pythonic' way of pulling this off? I figure it may along the lines of a pivot paired with a Transform, but tests so far have been ugly at best.
So we need to melt (or stack ) your original dataframe, then we doing pd.crosstab, you can using pd.pivot_table as well.
s=df.set_index('Category_4').stack().reset_index().rename(columns={0:'value'})
pd.crosstab([s.Category_4,s.level_1],s['value'])
Out[532]:
value a b c
Category_4 level_1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
Using get_dummies first, then summing across index levels
d = pd.get_dummies(df.set_index('Category_4'))
d.columns = d.columns.str.rsplit('_', 1, True)
d = d.stack(0)
# This shouldn't be necessary but is because the
# index gets bugged and I'm "resetting" it
d.index = pd.MultiIndex.from_tuples(d.index.values)
d.sum(level=[0, 1])
a b c
y Category_1 3 0 0
Category_2 2 1 0
Category_3 1 1 1
n Category_1 0 1 1
Category_2 0 1 1
Category_3 0 0 2
I would like to find all cases for all ids in a Pandas DataFrame.
What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.
Input (Pandas DataFrame)
id | name |
-----------
1 | bla1 |
2 | bla2 |
2 | bla3 |
2 | bla4 |
3 | bla5 |
4 | bla9 |
5 | bla6 |
5 | bla7 |
6 | bla8 |
Cases
names [
[bla2, bla3, bla4], #case 1
[bla1, bla3, bla7], #case 2
[bla3, bla1, bla6], #case 3
[bla6, bla7] #case 4
]
Needed output (unless there is a more efficient way)
id | case1 | case2 | case3 | case4 |
------------------------------------
1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 |
names = [
['bla2', 'bla3', 'bla4'], # case 1
['bla1', 'bla3', 'bla7'], # case 2
['bla3', 'bla1', 'bla6'], # case 3
['bla6', 'bla7'] # case 4
]
df = df.groupby('id').apply(lambda x: \
pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
.rename(columns=lambda x: 'case{}'.format(x + 1))
df
+------+---------+---------+---------+---------+
| id | case1 | case2 | case3 | case4 |
|------+---------+---------+---------+---------|
| 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 6 | 0 | 0 | 0 | 0 |
+------+---------+---------+---------+---------+
First, groupby id, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin in conjunction with the list comprehension. The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns.