How do you set a specific column with a specific value to a new value in a Pandas DF? - python

I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1

I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1

As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder

Related

Find "most used items" per "level" in big csv file with Pandas

I have a rather big csv file and I want to find out which items are used the most at a certain player level.
So one column I'm looking at has all the player levels (from 1 to 30) another column has all the item names (e.g. knife_1, knife_2, etc.) and yet another column lists backpacks (backback_1, backpack_2, etc.).
Now I want to check which is the most used knife and backpack for player level 1, for player level 2, player level 3, etc.
What I've tried was this but when I tried to verify it in Excel (with countifs) the results were different:
import pandas as pd
df = pd.read_csv('filename.csv')
#getting the columns I need:
df = df[["playerLevel", "playerKnife", "playerBackpack"]]
print(df.loc[df["playerLevel"] == 1].mode())
In my head, this should locate all the rows with playerLevel 1 and then only print out the most used items for that level. However, I wanted to double-check and used "countifs" in excel which gave me a different result.
Maybe I'm thinking too simple (or complicated) so I hope you can either verify that my code should be correct or point out the error.
I'm also looking for an easy way to then go through all levels automatically and print out the most used items for each level.
Thanks in advance.
Edit:
Dataframe example. Just imagine there are thousands of players that can range from level 1 to level 30. And especially on higher levels, they have access to a lot of knives and backpacks. So the combinations are limitless.
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
Try the following:
data = """\
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
"""
import io
import pandas as pd
stream = io.StringIO(data)
df = pd.read_csv(stream, sep='\s+')
df = df.drop('index', axis='columns')
print(df.groupby('playerLevel').agg(pd.Series.mode))
yields
playerKnife playerBackpack
playerLevel
1 knife_1 backpack_1
2 [knife_2, knife_3] [backpack_1, backpack_2]
3 knife_1 backpack_2
13 knife_10 backpack_9
15 knife_13 backpack_12
Note that the result of df.groupby('playerLevel').agg(pd.Series.mode) is a DataFrame, so you can assign that result and use it as a normal dataframe.
For data plain from a CSV file, simply use
df = pd.read_csv('filename.csv')
df = df[['playerLevel, 'playerKnife', 'playerBackpack']] # or whichever columns you want
stats = df.groupby('playerLevel').agg(pd.Series.mode)) # stats will be dataframe as well

Fastest way to limit number of decimals in Pandas DataFrame and add percentage sign [duplicate]

I have a pd.DataFrame of floats:
import numpy as np
import pandas as pd
pd.DataFrame(np.random.rand(5, 5))
0 1 2 3 4
0 0.795329 0.125540 0.035918 0.645191 0.819097
1 0.755365 0.333681 0.466814 0.892638 0.610947
2 0.545204 0.313946 0.538049 0.237857 0.365573
3 0.026813 0.867013 0.843798 0.923359 0.464778
4 0.514968 0.201144 0.853341 0.951981 0.214948
I'm trying to format all of them as percentage:
0 1 2 3 4
0 "25.60%" "11.55%" "98.62%" "73.16%" "38.85%"
1 "26.01%" "28.57%" "65.21%" "32.55%" "93.45%"
2 "19.99%" "41.97%" "57.21%" "61.26%" "83.34%"
3 "41.54%" "71.02%" "52.93%" "42.78%" "49.77%"
4 "33.77%" "70.48%" "36.64%" "97.42%" "83.19%"
or
0 1 2 3 4
0 25.60% 11.55% 98.62% 73.16% 38.85%
1 26.01% 8.57% 65.21% 32.55% 93.45%
2 19.99% 41.97% 57.21% 61.26% 83.34%
3 41.54% 1.02% 52.93% 42.78% 49.77%
4 33.77% 70.48% 36.64% 97.42% 83.19%
Many solutions exist, but for a single column, for example here. I'm trying to edit the values, so I'm not interested in changing the float display format.
How can I proceed?
Try this:
df = df.applymap('{:,.2%}'.format)
Or np.vectorize:
df[:] = np.vectorize('{:,.2%}'.format)(df)
If you want to display numeric data as percents, while keeping the underlying data numeric and not strings, then you can use dataframe styling:
https://stackoverflow.com/a/55624414

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]

Pandas error: "None of [Index([' '], dtype='object')] are in the [columns]"

For some reason, my code works when a list I am passing contains only integers. Using strings otherwise leads to the error in the title.
Here is my code:
def get_support(self, data, itemset):
return data[itemset].all(axis = 'columns').sum()
# I also tried: return data.loc[:, itemset].all(axis = 'columns').sum()
# this function returns the number of True values (from .all()) of a given column or given set of columns
A sample of a data where this code works is:
0 1 2
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Running get_support(df, [0]) returns 4 and running get_support(df, [0,2] returns 2.
However, once columns are labeled, the code no longer works and outputs the error. I've checked the .csv file, and it's completely clean, with no spaces or extra stuff.
Sample of a data that will cause an error in my code:
Red Yellow Blue
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Where exactly am I wrong?
Edit: Thank you very very much to #osint_alex. The error is gone now, but there is unfortunately a newfound problem:
print(get_support(temp_df, ['A']))
print(get_support(temp_df, ['A', 'B']))
print(get_support(temp_df, ['A', 'B', 'C']))
Running this block of code only outputs this value for each: 9835, which is the number of rows the dataset has
I have attempted commenting out the other lines, and I get 9835 nonetheless. However, after checking the .csv file, I should only get 516 for A (unable to test for others).
As of now, I am still trying to solve it on my own, but the numbers are too all over the place I do not even where to begin.
I think what is going wrong is that you are using itemset to select the columns of interest in the dataframe. So if you want to use itemset to denote the indicies of the columns - which is what I assume you want to do if you want to always pass in numbers - then the get_support function needs to change to reflect this.
Essentially if you pass in numbers and the columns are labelled with strings, you'll get a key error because those numbers aren't the column names.
Here is a suggested revision:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
'B': [3,4,0]})
def get_support(data, itemset):
cols = [x for x in data.columns if list(data.columns).index(x) in itemset]
return data[cols].all(axis = 'columns').sum()
print(get_support(df, [0, 1]))
Out:
2
You're going wrong here:
data[itemset].all(axis = 'columns').sum()
You can't sum() a string. You could run it through a data cleaning function first to make sure the list only has integers or floats.

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

Categories

Resources