Splitting multiple columns on a delimiter into rows in pandas dataframe [duplicate] - python

This question already has answers here:
pandas: records with lists to separate rows
(3 answers)
Closed 4 years ago.
I have a pandas dataframe as shown here:
id pos value sent
1 a/b/c test/test2/test3 21
2 d/a test/test5 21
I would like to split (=explode)df['pos'] and df['token'] so that the dataframe looks like this:
id pos value sent
1 a test 21
1 b test2 21
1 c test3 21
2 d test 21
2 a test5 21
It doesn't work if I split each column and then concat them à la
pos = df.token.str.split('/', expand=True).stack().str.strip().reset_index(level=1, drop=True)
df1 = pd.concat([pos,value], axis=1, keys=['pos','value'])
Any ideas? I'd really appreciate it.
EDIT:
I tried using this solution here : https://stackoverflow.com/a/40449726/4219498
But I get the following error:
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
I suppose this is a numpy related issue although I'm not sure how this happens. I'm using Python 2.7.14

I tend to avoid the stack magic in favour of building a new dataframe from scratch. This is usually also more efficient. Below is one way.
import numpy as np
from itertools import chain
lens = list(map(len, df['pos'].str.split('/')))
res = pd.DataFrame({'id': np.repeat(df['id'], lens),
'pos': list(chain.from_iterable(df['pos'].str.split('/'))),
'value': list(chain.from_iterable(df['value'].str.split('/'))),
'sent': np.repeat(df['sent'], lens)})
print(res)
id pos sent value
0 1 a 21 test
0 1 b 21 test2
0 1 c 21 test3
1 2 d 21 test
1 2 a 21 test5

Related

Filtering data-frame columns using regex, then using .groupby to calculate sum

I have a dataframe which I want to group, filter columns by regex, and then sum.
My code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3],
'Invasive' : [12,1,1,0,1,0],
'invasive': [1,4,5,3,4,6],
'Wild':[4,7,1,0,0,0],
'wild':[0,0,9,8,3,2],
'Crop':[0,0,0,0,0,0],
'Crop_2':[2,3,2,2,1,2]})
df.groupby(['ID']).filter(regex='(Invasive)|(invasive)|(Wild)|(wild)').sum()
The error message I get is:
DataFrameGroupBy.filter() missing 1 required positional argument: 'func'
I get the same Err message if groupby comes after filter
Why does this happen? Where do I input the func argument?
EDIT:
My Expected output is one column that has summed across the filtered columns and is grouped by ID. E.g.:
ID Output
0 1 29
1 2 27
2 3 16
What you want to do doesn't make sense, groupby.filter is to filter rows, not to be confused with DataFrame.filter.
You likely want to filter the columns, then to aggregate:
df.filter(regex='(?i)(Invasive|Wild)').groupby(df['ID']).sum()
NB. I replaced (Invasive)|(invasive)|(Wild)|(wild) by (?i)(Invasive|Wild), which means 'InvasiveORWild` independently of the case.
Output:
Invasive invasive Wild wild
ID
1 13 5 11 0
2 1 8 1 17
3 1 10 0 5
edit:
the output that you show needs a further summation per row:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.groupby(df['ID']).sum()
.sum(axis=1)
.reset_index(name='Output')
)
# or with summation before:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.sum(axis=1)
.groupby(df['ID']).sum()
.reset_index(name='Output')
)
Output:
ID Output
0 1 29
1 2 27
2 3 16

String slice of a column in a datframe [duplicate]

This question already has answers here:
Pandas make new column from string slice of another column
(3 answers)
Closed 4 months ago.
data = [['Tom', '5-123g'], ['Max', '6-745.0d'], ['Bob', '5-900.0e'], ['Ben', '2-345',], ['Eva', '9-712.x']]
df = pd.DataFrame(data, columns=['Person', 'Action'])
I want to shorten the "Action" column to a length of 5. My current df has two columns:
['Person'] and ['Action']
I need it to look like this:
person Action Action_short
0 Tom 5-123g 5-123
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-900
3 Ben 2-345 2-345
4 Eva 9-712.x 9-712
What I´ve tried was:
Checking the type of the Column
df['Action'].dtypes
The output is:
dtype('0')
Then I tried:
df['Action'] = df['Action'].map(str)
df['Action_short'] = df.Action.str.slice(start=0, stop=5)
I also tried it with:
df['Action'] = df['Action'].astype(str)
df['Action'] = df['Action'].values.astype(str)
df['Action'] = df['Action'].map(str)
df['Action'] = df['Action'].apply(str)```
and with:
df['Action_short'] = df.Action.str.slice(0:5)
df['Action_short'] = df.Action.apply(lambda x: x[:5])
df['pos'] = df['Action'].str.find('.')
df['new_var'] = df.apply(lambda x: x['Action'][0:x['pos']],axis=1)
The output from all my versions was:
person Action Action_short
0 Tom 5-123g 5-12
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-90
3 Ben 2-345 2-34
4 Eva 9-712.x 9-712
The lambda funktion is not working with 3-222 it sclices it to 3-22
I don't get it why it is working for some parts and for others not.
Try this:
df['Action_short'] = df['Action'].str.slice(0, 5)
By using .str on a DataFrame or a single column of a DataFrame (which is a pd.Series), you can access pandas string manipulation methods that are designed to look like the string operations on standard python strings.
# slice by specifying the length you need
df['Action_short']=df['Action'].str[:5]
df
Person Action Action_short
0 Tom 5-123g 5-123
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-900
3 Ben 2-345 2-345
4 Eva 9-712.x 9-712

Regex Sub and Pandas

I am struggling replacing a string in a pandas cell with data from a dictionary. I have a pandas frame:
import pandas as pd
import numpy as np
import re
f = {'GAAP':['<1>','2','3','4'],'CP':['5','6','<7>','8']}
filter = pd.DataFrame(data=f)
filter
and a dictionary:
d = {'GAAP':['100','101'],'CP':['500','501','502']}
d
I am trying to get the following output:
op = {'GAAP':['100|101','2','3','4'],'CP':['5','6','500|501|502','8']}
op = pd.DataFrame(data=op)
op
I tried something like:
def rep1(fr,di):
op=re.sub('\<.*?\>',fr,di)
return(op)
a='|'.join(d['GAAP'])
op=rep1(filter['GAAP'],a)
op
but get an error saying series objects are mutable and cannot be hashed.Any suggestions on what I am doing wrong ?
Let us try use pd.to_numeric convert the <> to NaN, then fillna
filter=filter.apply(pd.to_numeric,errors='coerce').fillna(pd.Series(d).str.join('|'))
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8
one way about it using replace : get the regexes that match the <> and pair them with their replacements from the dictionary.
outcome = filter.replace({'GAAP':"\<\d\>", 'CP':"\<\d\>"},
{"GAAP":"|".join(d['GAAP']), "CP":"|".join(d["CP"])},
regex=True)
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8

Pandas: Dataframe.Drop - ValueError: labels ['id'] not contained in axis

Attempting to drop a column from a DataFrame in Pandas. DataFrame created from a text file.
import pandas as pd
df = pd.read_csv('sample.txt')
df.drop(['a'], 1, inplace=True)
However, this generates the following error:
ValueError: labels ['a'] not contained in axis
Here is a copy of the sample.txt file :
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Thanks in advance.
So the issue is that your "sample.txt" file doesn't actually include the data you are trying to remove.
Your line
df.drop(['id'], 1, inplace=True)
is attepmting to take your DataFrame (which includes the data from your sample file), find the column where the value is 'id' in the first row (axis 1) and do an inplace replace (modify the existing object rather than create a new object missing that column, this will return None and just modify the existing object.).
The issue is that your sample data doesn't include a column with a header equal to 'id'.
In your current sample file, you can only to a drop where the value in axis 1 is 'a', 'b', 'c', 'd', or 'e'. Either correct your code to drop one of those values or get a sample files with the correct header.
The documentation for Pandas isn't fantastic, but here is a good example of how to do a column drop in Pandas: http://chrisalbon.com/python/pandas_dropping_column_and_rows.html
** Below added in response to Answer Comment from #saar
Here is my example code:
Sample.txt:
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Sample Code:
import pandas as pd
df = pd.read_csv('sample.txt')
print('Current DataFrame:')
print(df)
df.drop(['a'], 1, inplace=True)
print('\nModified DataFrame:')
print(df)
Output:
>>python panda_test.py
Current DataFrame:
a b c d e
0 1 2 3 4 5
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6 7 8
Modified DataFrame:
b c d e
0 2 3 4 5
1 3 4 5 6
2 4 5 6 7
3 5 6 7 8
bad= pd.read_csv('bad_modified.csv')
A=bad.sample(n=10)
B=bad.drop(A.index,axis=0)
This is an example of dropping a dataframe partly.
In case you need it.

Selecting max within partition for pandas dataframe [duplicate]

This question already has answers here:
Python pandas - filter rows after groupby
(4 answers)
Closed 8 years ago.
I have a pandas dataframe. My goal is to select only those rows where column C has the largest value within group B. For example, when B is "one" the maximum value of C is 311, so I would like the row where C = 311 and B = "one."
import pandas as pd
import numpy as np
df2 = pd.DataFrame({ 'A' : 1.,
'A' : pd.Categorical(["test1","test2","test3","test4"]),
'B' : pd.Categorical(["one","one","two","two"]),
'C' : np.array([311,42,31,41]),
'D' : np.array([9,8,7,6])
})
df2.groupby('C').max()
Output should be:
test1 one 311 9
test4 two 41 6
You can use idxmax(), which returns the indices of the max values:
maxes = df2.groupby('B')['C'].idxmax()
df2.loc[maxes]
Output:
Out[11]:
A B C D
0 test1 one 311 9
3 test4 two 41 6

Categories

Resources