dividing a string into separate columns pandas python

dividing a string into separate columns pandas python - python

I have seen similar questions, but nothing that really matchs my problem. If I have a table of values such as:
value
a
b
b
c
I want to use pandas to add in columns to the table to show for example:
value a b
a 1 0
b 0 1
c 0 0
I have tried the following:
df['a'] = 0
def string_count(indicator):
if indicator == 'a':
df['a'] == 1
df['a'].apply(string_count)
But this produces:
0 None
1 None
2 None
3 None
I would like to at least get to the point where the choices are hardcoded in (i.e I already know that a,b and c appear), but would even better if I could look set the column of strings and then insert a column for each unique string.
Am I approaching this the wrong way?

dummies = pd.get_dummies(df.value)
a b c
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
If you only want to display unique occurrences, you can add:
dummies.index = df.value
dummies.drop_duplicates()
a b c
value
a 1 0 0
b 0 1 0
c 0 0 1
Alternatively:
df = df.join(pd.get_dummies(df.value))
value a b c
0 a 1 0 0
1 b 0 1 0
2 b 0 1 0
3 c 0 0 1
Where you could again .drop_duplicates() to only see unique entries from the value column.

Related

Re-order Columns In A Data Frame Depending On Conditions Of Values

a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?

We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0

Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0

Frequency distribution of list against another column

I am trying to get the frequency distribution of column which is a list of words against the class labels.
Label Numbers
0 [(a,b,c)]
0 [(d)]
0 [(e,f,g)]
1 [(a,z)]
1 [(d,x,y)]
The output should be:
0 1
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1

The list of sets in the 'Numbers' column makes manipulating the DataFrame as-is very difficult (this is not tidy data). The solution is to expand out the DataFrame so that you only have one number in the 'Numbers' column corresponding to one value in the 'Label' column. Assuming your data is in a DataFrame called df, the following code performs that operation:
rows_list = []
for index, row in df.iterrows():
for element in row['Numbers'][0]:
dict1 = {}
dict1.update(key=row['Label'], value=element)
rows_list.append(dict1)
new_df = pd.DataFrame(rows_list)
new_df.columns = ['Label', 'Numbers']
The result is
Label Numbers
0 0 a
1 0 b
2 0 c
3 0 d
4 0 e
5 0 f
6 0 g
7 1 a
8 1 z
9 1 d
10 1 x
11 1 y
Now it's a matter of pivoting:
print(new_df.pivot_table(index='Numbers', columns='Label', aggfunc=len,
fill_value=0))
The result is
Label 0 1
Numbers
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
See the first answer here for the last line of code.

Select rows which have only zeros in columns

I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!

You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0

Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.

I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge

# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.

According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dividing a string into separate columns pandas python - python

Related

Re-order Columns In A Data Frame Depending On Conditions Of Values

Frequency distribution of list against another column

Select rows which have only zeros in columns

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

How to use trailing rows on a column for calculations on that same column | Pandas Python

Categories

Resources