I have a dataframe that looks like this
+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583 |
+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 1 |
| code | 1 | 0 | 0 | 1 |
+--------+--------+--------+--------+--------+
I would like that for each 1 for the index rows 'code', I have the name of the corresponding column in a new column 'key_name', here is the desired final result
+--------+--------+--------+--------+--------+--------+
| index | Q111 | Q570 | Q7891 |Info583|key_name|
+--------+--------+--------+--------+--------+--------+
| 1 | 1 | 0 | 0 | 0 | Q111 |
| 2 | 0 | 1 | 1 | 0 | nan |
| 3 | 0 | 0 | 0 | 1 | nan |
| 4 | 1 | 0 | 0 | 1 | Info583|
| code | 1 | 0 | 0 | 1 | nan |
+--------+--------+--------+--------+--------+--------+
Thx for any help or advices !
I think this is what you're looking for:
df['key_name'] = np.nan
condition = df.loc['code', :] == 1
df.loc[condition, 'key_name'] = df.columns[condition]
First make the column with just nan's. Then compute your condition: row with index 'code' equals 1. Then plug in the column names when condition is met.
I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working
Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX which is a string object.
dataset1=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
and
dataset2 =
SAX timestamp
0 hssrlcu 16015
1 ktyuymp 16016
2 xncqmfr 16017
3 aanlmna 16018
4 urvahvo 16019
... ... ...
263455 jeivqzo 279470
263456 bzasxgw 279471
263457 jspqnqv 279472
263458 sxwfchj 279473
263459 gxqnhfr 279474
263460 rows × 2 columns
I need to find and print out the timestamps for whenever a value in SAX column of dataset1 exists in SAX column of dataset2.
Is there a function/method for accomplishing the above?
Thanks.
Let's create an arbitrary dataset to showcase how it works:
import pandas as pd
import numpy as np
def sax_generator(num):
return [''.join(chr(x) for x in np.random.randint(97, 97+26, size=4)) for _ in range(num)]
df1 = pd.DataFrame(sax_generator(10), columns=['sax'])
df2 = pd.DataFrame({'sax': sax_generator(10), 'timestamp': range(10)})
Let's peek into the data:
df1 =
| | sax |
|---:|:------|
| 0 | cvtj |
| 1 | fmjy |
| 2 | rjpi |
| 3 | gwtv |
| 4 | qhov |
| 5 | uriu |
| 6 | kpku |
| 7 | xkop |
| 8 | kzoe |
| 9 | nydj |
df2 =
| | sax | timestamp |
|---:|:------|------------:|
| 0 | kzoe | 0 |
| 1 | npyo | 1 |
| 2 | uriu | 2 |
| 3 | hodu | 3 |
| 4 | rdko | 4 |
| 5 | pspn | 5 |
| 6 | qnut | 6 |
| 7 | gtyz | 7 |
| 8 | gfzs | 8 |
| 9 | gcel | 9 |
Now ensure we have some matching values in df2 from df1, which we can later check:
df2['sax'][2] = df1['sax'][5]
df2['sax'][0] = df1['sax'][8]
Then use:
df2.loc[df1.sax.apply(lambda x: df2.sax.str.contains(x)).any(), 'timestamp']
to get:
| | timestamp |
|---:|------------:|
| 0 | 0 |
| 2 | 2 |
With np.where docs here you can get the indices back as well:
np.where(df1.sax.apply(lambda x: df2.sax.str.contains(x)) == True)
# -> (array([5, 8]), array([2, 0]))
Here we can see that df1 has matching indices [5, 8] and df2 has [2, 0], which is exactly what we enforced with the lines above...
If we have a look at the return of df1.sax.apply(lambda x: df2.sax.str.contains(x)), the result above matches exactly the indices (magic...whooo):
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Step1: Convert Dataset 2 to a dict using:
import numpy as np
import pandas as pd
a_dictionary = df.to_dict['list]
Step2: Use a comparator in a for loop to extract time stamps.
lookup_value = "abcdef" #This can be a list item.
all_keys = []
for key, value in a_dictionary.items():
if(value == lookup_value):
all_keys.append(key)
print(all_keys)
Step3: ENJOY!
I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64
I'm progressively learning pandas, I figured out that pd.crosstab() can do marvels but I've hard time to make it work in that case.
I have a list of objects obj tagged with an int, I want to have the matrix of the objects sharing the same tag (1 if it's the same, 0 else)
| obj | tag |
|-----|-----|
| a | 0 |
| b | 2 |
| c | 1 |
| ... | ... |
| z | 2 |
->
| | a | b | c | ... | z |
|-----|---|---|---|-----|---|
| a | 1 | 0 | 0 | . | 0 |
| b | 0 | 1 | 0 | . | 1 |
| c | 0 | 0 | 1 | . | 0 |
| ... | . | . | . | . | 0 |
| z | 0 | 1 | 0 | 0 | 1 |
There are some formidables ways to do it, is there one more panda-friendly ?
PS : Tried with pd.crosstab(df.obj, df.obj, values=df.tag, aggfunc=[np.sum]) but NaN filled.
You can use merge with crosstab and DataFrame.rename_axis:
df = df.merge(df, on='tag')
df = pd.crosstab(df.obj_x, df.obj_y).rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c z
a 1 0 0 0
b 0 1 0 1
c 0 0 1 0
z 0 1 0 1