image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1
I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.
Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0
I have a problem while transposing a Pandas DataFrame that has the following structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
foo 0 4 0 0 0 0 0 0 0 0 14 1 0 1 0 0 0
bar 0 6 0 0 4 0 5 0 0 0 0 0 0 0 1 0 0
lorem 1 3 0 0 0 1 0 0 2 0 3 0 1 2 1 1 0
ipsum 1 2 0 1 0 0 1 0 0 0 0 0 4 0 6 0 0
dolor 1 2 4 0 1 0 0 0 0 0 2 0 0 1 0 0 2
..
With index:
foo,bar,lorem,ipsum,dolor,...
And this is basically a terms-documents matrix, where rows are terms and the headers (0-16) are document indexes.
Since my purpose is clustering documents and not terms, I want to transpose the dataframe and use this to perform a cosine-distance computation between documents themselves.
But when I transpose with:
pd.transpose()
I get:
foo bar ... pippo lorem
0 0 0 ... 0 0
1 4 6 ... 0 0
2 0 0 ... 0 0
3 0 0 ... 0 0
4 0 4 ... 0 0
..
16 0 2 ... 0 1
With index:
0 , 1 , 2 , 3 , ... , 15, 16
What I would like?
I'm looking for a way to make this operation preserving the dataframe index. Basically the first row of my new df should be the index.
Thank you
We can use a series of unstack
df2 = df.unstack().to_frame().unstack(1).droplevel(0,axis=1)
print(df2)
foo bar lorem ipsum dolor
0 0 0 1 1 1
1 4 6 3 2 2
2 0 0 0 0 4
3 0 0 0 1 0
4 0 4 0 0 1
5 0 0 1 0 0
6 0 5 0 1 0
7 0 0 0 0 0
8 0 0 2 0 0
9 0 0 0 0 0
10 14 0 3 0 2
11 1 0 0 0 0
12 0 0 1 4 0
13 1 0 2 0 1
14 0 1 1 6 0
15 0 0 1 0 0
16 0 0 0 0 2
Assuming data is square matrix (n x n) and if I understand the question correctly
df = pd.DataFrame([[0, 4,0], [0,6,0], [1,3,0]],
index =['foo', 'bar', 'lorem'],
columns=[0, 1, 2]
)
df_T = pd.DataFrame(df.values.T, index=df.index, columns=df.columns)
I have the data with column names as days up to 3000 columns with values 0/1, ex;
And would like to convert/group the columns as weekly (1-7 in week_1 & 8-14 in week_2), ex;
if the columns between 1-7 has at least 1 then week_1 should return 1 else 0.
Convert first column to index and then aggregate max by helper array created by integer division of 7 and added 1:
pd.options.display.max_columns = 30
np.random.seed(2020)
df = pd.DataFrame(np.random.choice([1,0], size=(5, 21), p=(0.1, 0.9)))
df.columns += 1
df.insert(0, 'id', 1000 + df.index)
print (df)
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 \
0 1000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1001 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1002 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
3 1003 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
4 1004 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 21
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
df = df.set_index('id')
arr = np.arange(len(df.columns)) // 7 + 1
df = df.groupby(arr, axis=1).max().add_prefix('week_').reset_index()
print (df)
id week_1 week_2 week_3
0 1000 0 0 0
1 1001 1 0 0
2 1002 1 1 0
3 1003 1 1 1
4 1004 1 0 0
import pandas as pd
import numpy as np
id = list(range(1000, 1010))
cl = list(range(1,22))
data_ = np.random.rand(10,21)
data_
client_data = pd.DataFrame(data=data_, index=id, columns=cl)
def change_col(col_hd=int):
week_num = (col_hd + 6) // 7
week_header = 'week_' + str(week_num)
return week_header
new_col_header = []
for c in cl:
new_col_header.append(change_col(c))
client_data.columns = new_col_header
client_data.columns.name = 'id'
client_data.groupby(axis='columns', level=0).sum()
I am trying to create a coincidence matrix between energy events measured by detectors in two channels. "Coincidence" is to say that the events occur within a user-specified timing window of each other. The data are currently stored in a pandas dataframe of the following format with fake sample data:
Energy Timestamp Channel
___________________________
6 103 1
7 70 2
4 110 2
8 205 2
2 219 1
3 333 1
5 300 1
9 350 2
I need the data in the following format such that, if a user were to select a timing window of 20, the resulting coincidence matrix would be:
Channel 1 Energy: 1 2 3 4 5 6 7 8 9 10
Channel 2 Energy:_________________________________________
1| 0 0 0 0 0 0 0 0 0 0
2| 0 0 0 0 0 0 0 0 0 0
3| 0 0 0 0 0 0 0 0 0 0
4| 0 0 0 0 0 1 0 0 0 0
5| 0 0 0 0 0 0 0 0 0 0
6| 0 0 0 0 0 0 0 0 0 0
7| 0 0 0 0 0 0 0 0 0 0
8| 0 1 0 0 0 0 0 0 0 0
9| 0 0 1 0 0 0 0 0 0 0
10| 0 0 0 0 0 0 0 0 0 0
Where now only the events that meet the condition:
Event1_Timestamp < Event2_Timestamp + Timing window & Event1_Timestamp > Event2_Timestamp - Timing window
are preserved in the coincidence matrix, and all noncoincident events are discarded.
I have tried:
df2 = df.merge(df, on="Timestamp")
df3 = pd.crosstab(df2.Energy_x, df2.Energy_y)
but there are a few problems with this output. It looks for exact matches in the timestamp rather than a timing window range, and it only lists the energies that appear, rather than a linearly spaced range of all possible energies (0-8192 energy bins). Any help is greatly appreciated.
Let's try using pd.merge_asof and pd.crosstab:
Where df,
Energy Timestamp Channel
0 6 103 1
1 7 70 2
2 4 110 2
3 8 205 2
4 2 219 1
5 3 333 1
6 5 300 1
7 9 350 2
Then,
df_out = pd.merge_asof(df.sort_values('Timestamp'),
df.sort_values('Timestamp'),
on='Timestamp',
allow_exact_matches=False,
tolerance=20)
pd.crosstab(df_out['Energy_x'],
df_out['Energy_y']).reindex(index=np.arange(1,11),
columns=np.arange(1,11),
fill_value=0)
Output:
Energy_y 1 2 3 4 5 6 7 8 9 10
Energy_x
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 1 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 1 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0