This question already has an answer here:
Adding a new pandas column with mapped value from a dictionary [duplicate]
(1 answer)
Closed 2 years ago.
I am iterating through a dataframe and pulling out specific lines and then enriching those lines with some other elements. I have a dictionary that has the following definition mapping:
testdir = {0: 'zero', 40: 'forty', 60: 'sixty', 80: 'eighty'}
When i pull out a specific line from the original dataframe which looks like this
a b c x str
0 0 0 0 100.0 aaaa
i want the str cell to now be set to the string value of column c which is 0 so
output should be
a b c x str
0 0 0 0 100.0 zero
and then after meeting some other conditions a new line is pulled out from the original dataframe and the output should be
a b c x str
0 0 0 0 100.0 zero
3 4 30 60 100.0 sixty
i tried to use the map() method so something like'
df['str'][-1] = df['c'][-1].map(testdir)
but i'm erroring all over the place!
map is intended for pd.Series, so if you can, just map the entire column when it is fully populated, that way you avoid the oevrhead of multiple calls to map:
df['str'] = df.c.map(testdir)
print(df)
a b c x str
0 0 0 0 100.0 zero
3 4 30 60 100.0 sixty
Note that, to correctly index the dataframe on a single cell, and map with the dictionary, you need something like:
testdir[df.iat[-1, 2]]
Chained indexing as df['c'][-1] is discouraged in the docs, and has a negative effect when assigning, which is that your not updating the original dataframe, since it returns a copy.
Related
I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
This question already has answers here:
Creating dummy variables in pandas for python
(12 answers)
Closed 4 years ago.
For example, the Gender attribute will be transformed into two attributes, "Genre=M" and "Genre=F"enter image description here
and i need two columns Male and Female ,assigning binary values corresponding to the presence or not presence of the attribute
Method 1: You can make use of pd.get_dummies(colname) which will give you n new columns(where n is number of distinct values of that col) each representing binary flags to represent the value state for each row.
Method 2:
We can also use df. Colname. map({'M' :0,'F':1})
Method 3:
We can use replace command like df. Colname. replace(['M', 'F' ], [1, 0], inplace=True)
First method is onehot encoding other 2 is similar to label encoding
Use the pandas function get_dummies.
get_dummies: Convert categorical variable into dummy/indicator variables. Source.
Example of usage:
s = pd.Series(list('abca'))
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Solved Below
Issue: Cannot .groupby() sort because a single value is a string type object. The value at Index 5, ColA 10 for Data In is the issue. The value at Index 5 for ColA, 10, is a string object. pd.to_numeric() properly sorts the column if only sorted by that column.
Question: Can a single value in ColA be converted?
Method:
ind = pd.to_numeric(df['ColA'], errors='coerce').fillna(999).astype(int).argsort()
df = df.reindex(ind)
df = df.groupby(df.ColA).apply(pd.DataFrame.sort_values, 'ColB')
df = df.reset_index(drop=True)
Data in:
Index ColA ColB ColC
0 2 14-5 MumboJumbo
1 4 18-2 MumboJumbo2
2 2 24-5 MumboJumbo3
3 3 23-8 MumboJumbo4
4 2 13-6 MumboJumbo5
5 10 86-1 MumboJumbo6
6 10 42-1 MumboJumbo7
7 2 35-6 MumboJumbo8
8 Load NaN MumboJumbo9
Desired Output:
Index ColA ColB ColC
0 2 13-6 MumboJumbo5
1 2 14-5 MumboJumbo
2 2 24-5 MumboJumbo3
3 2 35-6 MumboJumbo8
4 3 23-8 MumboJumbo4
5 4 18-2 MumboJumbo2
6 10 42-1 MumboJumbo7
7 10 86-1 MumboJumbo6
8 Load NaN MumboJumbo9
Thanks!
I don't really understand the problem in the question but you can select specific values in a DataFrame using iloc (positional index) or loc (label index). Since you are asking to replace the value in the fifth row in the first column in your dataset, we use iloc.
df.iloc[from_row:to_row,column_position]
To convert value '10' in ColA at row 5 to an int('10') you simply select it and then update it.
df.iloc[5:6,0] = 10
If you don't know the location of the value you need to convert, then iloc and loc is no help.
There are several ways to convert all values in a column to a specific dtype. One way would be using a lambda-function.
df[column_name].apply(lambda x: int(x))
The lambda above will break because your data also contains the string Load and you can't convert that to an int. One way to solve this is adding conditions to your lambda.
df[column_name].apply(lambda x: int(x) if something else something)
Given the data in your question the most straightforward way would be to check if x is not 'Load':
df[column_name].apply(lambda x: int(x) if x != 'Load' else x)
This becomes a hassle if you have loads of actual strings in your column. If you want to use a lambda you could make a list of the actual strings. And then check if x is in the list.
list_of_strings = ['Load', 'Road', 'Toad']
df[column_name].apply(lambda x: int(x) if x not in list_of_strings else x)
Another way would be to write a separate function to manage the converting using try/catch blocks.
This question already has an answer here:
Python - splitting dataframe into multiple dataframes based on column values and naming them with those values [duplicate]
(1 answer)
Closed 4 years ago.
I am learning python and pandas and am having trouble overcoming an error while trying to subset a data frame.
I have an input data frame:
df0-
Index Group Value
1 A 10
2 A 15
3 B 20
4 C 10
5 C 10
df0.dtypes-
Group object
Value float64
That I am trying to split out into unique values based off of the Group column. With the output looking something like this:
df1-
Index Group Value
1 A 10
2 A 15
df2-
Index Group Value
3 B 20
df3-
Index Group Value
4 C 10
5 C 10
So far I have written this code to subset the input:
UniqueGroups = df0['Group'].unique().tolist()
OutputFrame = {}
for x in UniqueAgencies:
ReturnFrame[str('ConsolidateReport_')+x] = UniqueAgencies[df0['Group']==x]
The code above returns the following error, which I can`t quite work my head around. Can anyone point me in the right direction?
*** TypeError: list indices must be integers or slices, not str
you can use groupby to group the column
for _, g in df0.groupby('Group'):
print g
I have a pandas groupby object that I made from a larger dataframe, in which amounts are grouped under a person ID variable as well as whether it was an ingoing or outgoing transaction. Heres an example:
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8
(sorry I don't know how to put actual sample data in). Note that some folks can have one or the other (e.g., maybe they have some going out but nothing coming in).
All I want to go is get the difference in the amounts, collapsed under the person. So the ideal output would be, perhaps a dictionary or other dataframe, containing the difference in amounts under each person, like this:
ID Difference
1 -3
2 2
3 -6
4 -8
I have tried a handful of different ways to do this but am not sure how to work with these nested lists in python.
Thanks!
We couold select the rows that are Out and convert them to negative integers and then use sum().
import pandas as pd
s = '''\
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Select rows where In_Out == 'Out' and multiple by -1
df.loc[df['In_Out'] == 'Out', 'Amount'] *= -1
# Convert to dict
d = df.groupby('ID')['Amount'].sum().to_dict()
print(d)
Returns:
{1: -3, 2: 2, 3: -6, 4: -8}