iloc Pandas Slice Difficulties - python

Ive updated the below information to be a little clearer as per the comments:
I have the following dataframe df (it has 38 columns this is only the last few):
Col # 33 34 35 36 37 38
id 09.2018 10.2018 11.2018 12.2018 LTx LTx2
123 0.505 0.505 0.505 0.505 33 35
223 2.462 2.464 0.0 30.0 33 36
323 1.231 1.231 1.231 1.231 33 35
423 0.859 0.855 0.850 0.847 33 36
I am trying to create a new column which is the sum of a slice using iloc so an example for col 123 it would look like the following:
df['LTx3'] = (df.iloc[:, 33:35]).sum(axis=1)
This is perfect obviously for 123 but not for 223. I had assumed this would work:
df['LTx3'] = (df.iloc[:, 'LTx':'LTx2']).sum(axis=1)
But consistantly get the same error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [LTx] of <class 'str'>
I had been trying some variation of this such as below but unfortunatley also havent led to any working solution:
df['LTx3'] = (df.iloc[:, df.columns.get_loc('LTx'):df.columns.get_loc('LTx2')]).sum(axis=1)
Basically columns LTx and LTx2 are made up of integres but vary row to row. I want to use these integers as the references for the slice - Im not quite certain how I should do this.
If anyone could help lead me to a solution it would be fantastic!
Cheers

I'd recommend reading up on .loc, .iloc slicing in pandas:
https://pandas.pydata.org/pandas-docs/stable/indexing.html
.loc selects based on name(s). .iloc selects based on index (numerical) position.
You can also subset based on column names. Note also that depending on how you create your dataframe, you may have numbers cast as strings.
To get the row corresponding to 223:
df3 = df[df['Col'] == '223']
To get the columns corresponding to the names 33, 34, and 45:
df3 = df[df['Col'] == '223'].loc[:, '33':'35']
If you want to select rows wherein any column contains a given string, I found this solution: Most concise way to select rows where any column contains a string in Pandas dataframe?
df[df.apply(lambda row: row.astype(str).str.contains('LTx2').any(), axis=1)]

Related

Can I "flatten" a column with empty cells?

For example, I have a dataframe below with multiple columns and rows in which the last column only has data for some of the rows. How can I take that last column and write it to a new dataframe while removing the empty cells that would remain if I just copied the entire column?
Part Number Count Miles
2345125 14 543
5432545 12
6543654 6 112
6754356 22
5643545 6
7657656 8 23
7654567 11 231
3455434 34 112
The data frame I want to obtain would be below
Miles
543
112
23
231
112
I've tried converting the empty cells to NaN and then removing, but I always either get a key error or fail to remove the rows I want. Thanks for any help.
# copy the column
series = df['Miles']
# drop nan values
series = series.dropna()
# one-liner
series = df['Miles'].dropna()
Do you mean:
df.loc[df.Miles.notna(), 'Miles']
Or if you want to drop the rows:
df = df[df.Miles.notna()]

Merge levels for categorical levels in pandas

I am wondering, how to merge levels of a categorical variable in Python ?
I have the following dataset:
dataset['Reason'].value_counts().head(5).
Reason Count
0 339
7 125
11 124
3 82
0 65
Now, I want to merge the first and last occurrence of, so that the output looks like:
dataset['Reason'].value_counts().head(5)
Reason Count
0 404
7 125
11 124
3 82
2 52
In order to get to the reason, I have had to split a string, which might have led to the various levels in the reason column.
I have tried to use the loc function, but I am wondering, whether there is smarter way to do it:
dataset.loc[dataset['Reason'] == '0' , ['Reason']] = 'On request'
dataset.loc[dataset['Reason'] == '0 ' , ['Reason']] = 'On request'
Thanks, Michael.
Like #anky_91 mentioned use Series.str.strip if all values are strings:
dataset['Reason'].str.strip().value_counts().head(5)
If some values are numeric first cast to strings by Series.astype:
dataset['Reason'].astype(str).str.strip().value_counts().head(5)

Itering through DataFrame columns and deleting a row if cell value is not a number

I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])

Pandas: Join or merge multiple dataframes on a column where column values are repeating

I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).

Get substring from pandas dataframe while filtering

Say I have a dataframe with the following information:
Name Points String
John 24 FTS8500001A
Richard 35 FTS6700001B
John 29 FTS2500001A
Richard 35 FTS3800001B
John 34 FTS4500001A
Here is the way to get a DataFrame with the sample above:
import pandas as pd
keys = ('Name', 'Points', 'String')
names = pd.Series(('John', 'Richard', 'John', 'Richard', 'John'))
ages = pd.Series((24,35,29,35,34))
strings = pd.Series(('FTS8500001A','FTS6700001B','FTS2500001A','FTS3800001B','FTS4500001A'))
df = pd.concat((names, ages, strings), axis=1, keys=keys)
I want to select every row that meet the following criteria: Name=Richard And Points=35. And for such rows I want to read the 4th and 5th char of the String column (the two numbers just after FTS).
The output I want is the numbers 67 and 38.
I’ve tried several ways to achieve it but with zero results. Can you please help?
Thank you very much.
Eduardo
Use a boolean mask to filter your df and then call str and slice the string:
In [77]:
df.loc[(df['Name'] == 'Richard') & (df['Points']==35),'String'].str[3:5]
Out[77]:
1 67
3 38
Name: String, dtype: object
Pandas string methods
You can mask it on your criteria and then use pandas string methods
mask_richard = df.Name == 'Richard'
mask_points = df.Points == 35
df[mask_richard & mask_points].String.str[3:5]
1 67
3 38

Categories

Resources