I'm trying to get the value of a specific cell.
main_id name code
0 1345 Jones 32
1 1543 Jack 62
2 9874 Buck 86
3 2456 Slim 94
I want the cell that says code=94, as I already know the main_id but nothing else.
raw_data = {'main_id': ['1345', '1543', '9874', '2456'],
'name': ['Jones', 'Jack', 'Buck', 'Slim'],
'code': [32, 62, 86, 94]}
df=pd.DataFrame(raw_data, columns = ['main_id', 'name', 'code'])
v=df.loc[str(df['main_id']) == str(2456)]['code'].values
print(df.loc['name'])
The print(df.loc['name']) claims the label is not in index
And the v=df.loc[str(df['main_id']) == str(2456)]['code'].values says 'KeyError False'
df.loc['name'] raises a KeyError because name is not in the index; it is in the columns. When you use loc, the first argument is for index. You can use df['name'] or df.loc[:, 'name'].
You can also pass boolean arrays to loc (both for index and columns). For example,
df.loc[df['main_id']=='2456']
Out:
main_id name code
3 2456 Slim 94
You can still select a particular column for this, too:
df.loc[df['main_id']=='2456', 'code']
Out:
3 94
Name: code, dtype: int64
With boolean indexes, the returning object will always be a Series even if you have only one value. So you might want to access the underlying array and select the first value from there:
df.loc[df['main_id']=='2456', 'code'].values[0]
Out:
94
But better way is to use the item method:
df.loc[df['main_id']=='2456', 'code'].item()
Out:
94
This way, you'll get an error if the length of the returning Series is greater than 1 while values[0] does not check that.
Alternative solution:
In [76]: df.set_index('main_id').at['2456','code']
Out[76]: 94
Related
I have a dataframe where one of the columns have some rows with an array value instead of a single int64 value. I want to drop all such rows.
I am using the following code to do this but this is not working (for obvious reasons as it is being compared to a string).
handover_data.drop(handover_data[handover_data['S-PCI'] == '[105 106]'].index, inplace=True)
In the dataframe it should either have 105 or 106 not both but some of the palces have [105 106]
What are the ways to compare this to check if there is an array instead of the expected value:
Data set looks like the following:
S-Cell ID N-Cell ID S-PLMN S-PCI N-PCI S-BW N-BW \
73 257 0 105 105 106 2147483647 2147483647
S-EARFCN N-EARFCN
73 3025 3025 30102
Elapsed RT Time (ms) RSRP-105 RSRP-106 RSRQ-105 RSRQ-106
73 41846000000 2947094.0 -84 -90 -4 -14
EDIT:
s_Cell = 105
for i, j in hd_data.iterrows():
if(hd_data.at[i,'S-PCI'].all() != s_Cell):
hd_data.at[i,'H_Event'] = 1
This is failing with following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-399-d4c47a34a73e> in <module>
19 print(i)
20 print(handover_data.at[i,'S-PCI'])
---> 21 if(handover_data.at[i,'S-PCI'].all() != starting_Cell):
22 handover_data.at[i,'Handover_Event'] = 1
23 #handover_data.at[i,'Time_to_Handover'] = handover_data.at[i,'TimeInterval']-last_HO_time
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
you can filter by the values you want like
wanted_values = [105, 106]
handover_data = handover_data[handover_data['S-PCI'].isin(wanted_values)]
if you want to remove the items that are specifically a list then it'd be a bit more resource intensive
import numpy as np
handover_data = handover_data.apply(lambda x: np.nan if isinstance(x['S-PCI'], list) else x).dropna(subset=['S-PCI'])
try this
handover_data.drop(handover_data[handover_data['S-PCI'].apply(lambda x: not str(x).isdigit())].index, inplace=True)
I have a Pandas dataframe with columns labeled Ticks, Water, and Temp, with a few million rows (possibly billion on a complete dataset), but it looks something like this
...
'Ticks' 'Water' 'Temp'
215 4 26.2023
216 1 26.7324
217 17 26.8173
218 2 26.9912
219 48 27.0111
220 1 27.2604
221 19 27.7563
222 32 28.3002
...
(All temperatures are in ascending order, and all 'ticks' are also linearly spaced and in ascending order too)
What I'm trying to do is to reduce the data down to a single 'Water' value for each floored, integer 'Temp' value, and just the first 'Tick' value (or last, it doesn't really have that much of an effect on the analysis).
The current direction I'm working on doing this is to start at the first row and save the tick value, check if the temperature is an integer value greater than the previous, add the water value, move to the next row check the temperature value, add the water value if it's not a whole integer higher. If the temperature value is an integer value higher, append the saved 'tick' value and integer temperature value and the summed water count to a new dataframe.
I'm sure this will work but, I'm thinking there should be a way to do this a lot more efficiently using some type of application of df.loc or df.iloc since everything is nicely in ascending order.
My hopeful output for this would be a much shorter dataset with values that look something like this:
...
'Ticks' 'Water' 'Temp'
215 24 26
219 68 27
222 62 28
...
Use GroupBy.agg and Series.astype
new_df = (df.groupby(df['Temp'].astype(int))
.agg({'Ticks' : 'first', 'Water' : 'sum'})
#.agg(Ticks = ('Ticks', 'first'), Water = ('Water', 'sum'))
.reset_index()
.reindex(columns=df.columns)
)
print(new_df)
Output
Ticks Water Temp
0 215 24 26
1 219 68 27
2 222 32 28
I have some trouble understanding the rules for which ticks you want in the final dataframe, but here is a way to get the indices of all Temps with equal floored value:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Ticks': [215, 216, 217, 218, 219, 220, 221, 222],
'Water': [4, 1, 17, 2, 48, 1, 19, 32],
'Temp': [26.2023, 26.7324, 26.8173, 26.9912, 27.0111, 27.2604, 27.7563, 28.3002]})
# first floor all temps
data['Temp'] = data['Temp'].apply(np.floor)
# get the indices of all equal temps
groups = data.groupby('Temp').groups
print(groups)
# maybe apply mean?
data = data.groupby('Temp').mean()
print(data)
hope this helps
I have many repetitive index values in certain sequence.
I want to fetch the 1st value of my index cell 'ID' and when ever that value repeats, then append that set of values to next column. Here my Index values repeat 4 times, so output will have 4 columns labeled from i=1,2,3,4...till N for N no. of repeat index sets.
The starting value of index cell will differ for other data sets, but the values will repeat in same sequence.
Sample Dataset: df
ID,1
7,0.060896109
10,0.384263675
27,0.780060081
43,0.583200572
57,0.139564176
73,0.595220898
91,0.828783841
7,0.39920022
10,0.157306146
27,0.29750421
43,0.742234942
57,0.971849921
73,0.346905033
91,0.996723279
7,0.192197827
10,0.922323942
27,0.033304593
43,0.462253505
57,0.282632609
73,0.553047118
91,0.07678817
7,0.428707324
10,0.250935035
27,0.529861617
43,0.982468147
57,0.473807591
73,0.340980584
91,0.436675534
Expected Output Sample:
ID,1,2,3,4
7,0.060896109,0.39920022,0.192197827,0.428707324
10,0.384263675,0.157306146,0.922323942,0.250935035
27,0.780060081,0.29750421,0.033304593,0.529861617
43,0.583200572,0.742234942,0.462253505,0.982468147
57,0.139564176,0.971849921,0.282632609,0.473807591
73,0.595220898,0.346905033,0.553047118,0.340980584
91,0.828783841,0.996723279,0.07678817,0.436675534
Use DataFrame.pivot with helper column created by compare by first column with cumulative sum:
df = df.assign(g=np.cumsum(df.index == df.index[0])).pivot(columns='g',values='1')
print (df)
g 1 2 3 4
ID
7 0.060896 0.399200 0.192198 0.428707
10 0.384264 0.157306 0.922324 0.250935
27 0.780060 0.297504 0.033305 0.529862
43 0.583201 0.742235 0.462254 0.982468
57 0.139564 0.971850 0.282633 0.473808
73 0.595221 0.346905 0.553047 0.340981
91 0.828784 0.996723 0.076788 0.436676
You can use pd.pivot_table with a little extra work on the columns:
pd.pivot_table(data=df, index='ID', columns=df.groupby('ID').cumcount(), values='1')
0 1 2 3
ID
7 0.060896 0.399200 0.192198 0.428707
10 0.384264 0.157306 0.922324 0.250935
27 0.780060 0.297504 0.033305 0.529862
43 0.583201 0.742235 0.462254 0.982468
57 0.139564 0.971850 0.282633 0.473808
73 0.595221 0.346905 0.553047 0.340981
91 0.828784 0.996723 0.076788 0.436676
I´ve created a new column in a Dataframe that contains the categorical feature 'QD' which describes in which "decile" (the 10%, 20, 30% lower values) the value of another feature of the DataFrame is positioned. You can see the DF head below:
EPS CPI POC Vendeu Delta QD
1 20692 1 19185.30336 0 -1506.69664 QD07
8 20933 1 20433.27115 0 -499.72885 QD08
10 20393 1 20808.04948 0 415.04948 QD10
18 20503 1 19153.45978 0 -1349.54022 QD07
19 20587 1 20175.31906 1 -411.68094 QD09
Data Frame Head
The 'QD' column was created through the function below:
minimo = DF['EPS'].min()
passo = (DF['EPS'].max() - DF['EPS'].min())/10
def get_q(value):
for i in range(1,11):
if value < (minimo + (i*passo)):
return str('Q' + str(i).zfill(2))
Function applied on 'Delta'
Analyzing this column, I noticed something strange:
AUX2['QD'].unique()
out:
array(['QD07', 'QD08', 'QD10', 'QD09', 'QD06', 'QD05', 'QD04', 'QD03',
'QD02', 'QD01', None], dtype=object)
'QD' unique values
de .unique() method returns an array with an none value on it. At first I thought there was something wrong with the function, but when I tried to grab the position of the none value, look:
AUX2['QD'].value_counts()
out:
QD05 852
QD04 848
QD06 685
QD03 578
QD07 540
QD08 377
QD02 318
QD09 209
QD10 68
QD01 61
Name: QD, dtype: int64
.value_counts()
len(AUX2[AUX2['QD'] == None]['QD'])
out:
0
len()
What am I missing here?
When you are using .value_counts() add dropna=False
df[df['name column'].isnull()]
I have a dataframe which have several columns. I want to extract rows by combination of values from two specific columns, so I use the set_index() property to index the dataframe by those columns. I figured that after doing so, I will have a direct access (O(1)) to rows for a given combination of keys. Currently, It does not seem like the case, It takes quite some time for a df.ix[ix1,ix2] operation to take place.
Example:
Say I have the following dataframe:
In [228]: df
Out[228]:
ID1 ID2 score
752476 5626887150_0 5626887150_6 96
752477 5626887150_0 5626887150_7 95
752478 5626887150_0 5626887150_2 95
752479 5626887150_0 5626887150_8 93
752480 5626887150_0 5626887150_1 89
752481 5626887150_0 2142280814_5 88
752482 5626887150_0 5626887150_3 84
752483 5626887150_0 6625625104_5 82
752484 5626887150_0 2142280814_4 81
And say I want to look at the score column in different ID1,ID2 combinations. To easily do this, i'm setting ID1 and ID2 as indexes and obtain the following result:
In [230]: df = df.set_index(['ID1','ID2'])
Out[230]:
score
ID1 ID2
5626887150_0 5626887150_6 96
5626887150_7 95
5626887150_2 95
5626887150_8 93
5626887150_1 89
2142280814_5 88
5626887150_3 84
6625625104_5 82
2142280814_4 81
Now I can easy access my data with ID1,ID2 combinations (e.g. df.ix['5626887150_0','5626887150_6']), that's true. BUT, it does not seem to be an O(1) access. It seems to take quite some time to return a value on a large dataframe.
So what exactly is the set_index() method doing? and is there a way to force an O(1) acess to the data?