Dataframe slicing with string values - python

I have a string dataframe that I would like to modify. I need to cut off each row of the dataframe at a value say A4 and replace other values after A4 with -- or remove them. I would like to create a new dataframe that has values only upto the string "A4". How would i do this?
import pandas as pd
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','A5','A6'],['A1','A3','A2','A5','A4','A6'],['A1','A2','A4','A3','A6','A5'],['A2','A1','A3','A4','A5','A6'], ['A2','A1','A3','A4','A6','A5'],['A1','A2','A4','A3','A5','A6']]
input = pd.DataFrame(values, columns)
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','--','--'],['A1','A3,'A2','A5','A4','--'],['A1','A2','A4','--','--','--'],['A2','A1','A3','A4','--','--'], ['A2','A1','A3','A4','--','--'],['A1','A2','A4','--','--','--']]
output = pd.DataFrame(values, columns)

You can make a small function, that will take an array, and modify the values after your desired value:
def myfunc(x, val):
for i in range(len(x)):
if x[i] == val:
break
x[(i+1):] = '--'
return x
Then you need to apply the function to the dataframe in a rowwise (axis = 1) manner:
input.apply(lambda x: myfunc(x, 'A4'), axis = 1)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 A5 A4 --
c3 A1 A2 A4 -- -- --
c4 A2 A1 A3 A5 A4 --
c5 A2 A1 A4 -- -- --
c6 A1 A2 A4 -- -- --

I assume you will have values more than A4
df.replace('A([5-9])', '--', regex=True)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 -- A4 --
c3 A1 A2 A4 A3 -- --
c4 A2 A1 A3 -- A4 --
c5 A2 A1 A4 A3 -- --
c6 A1 A2 A4 A3 -- --

Related

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

Assign value to a column based on a string-based hierarchy

I am attempting to create a new column in a Pandas DataFrame where two columns are compared, and based on a pre-defined hierarchy a third column is populated after the comparison of the two columns. The new column will take the higher of the two based on the hierarchy. The hierarchy is as follows from highest to lowest:
A1
A2
A3
A4
A5
The DataFrame df is seen below.
sales_code price_bucket_a price_bucket_b
101 A1 A2
102 A3 A4
202 A2 A3
201 A4 A5
301 A2 A2
302 A5 A1
The desired output I am attempting to achieve is seen below.
sales_code price_bucket_a price_bucket_b price_bucket_hier
101 A1 A2 A1
102 A3 A4 A3
202 A2 A3 A2
201 A4 A5 A4
301 A2 A2 A2
302 A5 A1 A1
The hierarchy and DataFrame in question is just a snippet of the overall totals.
Any assistance that anyone could provide would be greatly appreciated.
First we need convert to category then we can do min or max to get the right answer
cat=['A1','A2','A3','A4','A5']
df[['price_bucket_a','price_bucket_b']].apply(lambda x : pd.Categorical(x, categories=cat,ordered=True )).min(axis=1)
0 A1
1 A3
2 A2
3 A4
4 A2
dtype: object
Here's one approach IIUC:
ix = df.filter(like='price').apply(lambda x: x.str.lstrip('A')).astype(int).idxmin(1)
df['price_bucket_hier'] = df.lookup(range(df.shape[0]), ix)
print(df)
sales_code price_bucket_a price_bucket_b price_bucket_hier
0 101 A1 A2 A1
1 102 A3 A4 A3
2 202 A2 A3 A2
3 201 A4 A5 A4
4 301 A2 A2 A2

Groupby and Sample pandas

I am trying to sample the resulting data after doing a groupby on multiple columns. If the respective groupby has more than 2 elements, I want to take sample 2 records, else take all the records
df:
col1 col2 col3 col4
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
target df:
col1 col2 col3 col4
A1 A2 A3 A4 or A5 or A6
A1 A2 A3 A4 or A5 or A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
I have mentioned A4 or A5 or A6 because, when we take sample, either of the three might return
This is what i have tried so far:
trial = pd.DataFrame(df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2)))
However, in this I do not get col1, col2 and col3
I think need double reset_index - first for remove 3.rd level of MultiIndex and second for convert MultiIndex to columns:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index(level=3, drop=True)
.reset_index())
Or reset_index with drop for remove column level_3:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index()
.drop('level_3', 1))
print (trial)
col1 col2 col3 col4
0 A1 A2 A3 A4
1 A1 A2 A3 A6
2 B1 B2 B3 B4
3 B1 B2 B3 B5
4 C1 C2 C3 C4
There is no need to convert this to a pandas dataframe its one by default
trial=df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2))
And this should add the col1,2,3
trial.reset_index(inplace=True,drop=False)

Taking last characters of a column of objects and making it the column on a dataframe - pandas python

I have a dataframe like the following:
df =
A B D
a1 b1 9052091001A
a2 b2 95993854906
a3 b3 93492480190
a4 b4 93240941993
What I want:
df_resp =
A B D
a1 b1 001A
a2 b2 4906
a3 b3 0190
a4 b4 1993
What I tried:
for i in (0,len(df['D'])):
df['D'][i]= df['D'][i][-4:]
Error I got:
KeyError: 4906
Also, it takes a really long time and I think there should be a quicker way with pandas.
Use pd.Series.str string accessor for vectorized string operations. These are preferred over using apply.
If D elements are already strings
df.assign(D=df.D.str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
If not
df.assign(D=df.D.astype(str).str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
You can change in place with
df['D'] = df.D.str[-4:]
Use the apply() method of pandas.Series, it will be way faster than iterating with a for loop...
This should work (provided the column contains only strings):
df_resp = df.copy()
df_resp['D'] = df_resp['D'].apply(lambda x : x[-4:])
As for the KeyError, it probably comes from your DataFrame's index, since calling df['D'][i] is equivalent to df.loc[i]['D'], i.e. i refers to the index's label, not its position. It would (probably) work if you replaced it with df.loc[i]['D'], which refers to the index at position i.
I hope this helps!

SQLite Python printing in rows?

Afternoon, i am trying to retrieve seat numbers from a database using the following code
cur.execute("SELECT * FROM seats")
while True:
row = cur.fetchone()
if row == None:
break
print row[0]
But when i do so, it prints out each individual record one per line like so :
A1
A2
B3 etc..
But i want each row to print out with the same letter if that makes sense such as :
A1 A2 A3 A4 A5 A6 A7 A8
B1 B2 B3 B4 B5 B6 B7
But i cant seem to get it like that ? How would i go about doing this?
Use the itertools.groupby() tool:
from itertools import groupby
for letter, rows in groupby(cur, key=lambda r: r[0][0]):
print ' '.join([r[0] for r in rows])
The groupby() function loops over each row in cur, take the first letter of the first column, and give you a tuples with each (letter, rows) values. The rows value is another iterable, you can loop over that (with a for loop, for example) to list all rows that have that first letter.
This does rely on the rows being sorted already. If your rows alternate between first letters:
A1
A2
B1
B2
A3
A4
it'll print those as separate groups:
A1 A2
B1 B2
A3 A4
You may want to add a ORDER BY firstcolumnname ordering instruction to your query to ensure correct grouping.
This is what I see when I create a test db:
>>> cur.execute("SELECT * FROM seats ORDER BY code")
<sqlite3.Cursor object at 0x10b1a8730>
>>> for letter, rows in groupby(cur, key=lambda r: r[0][0]):
... print ' '.join([r[0] for r in rows])
...
A1 A2 A3 A4 A5 A6 A7 A8
B1 B2 B3 B4 B5 B6 B7 B8
C1 C2 C3 C4 C5 C6 C7 C8

Categories

Resources