Parsing Pandas Series From Another Series - python

Am trying to parse a series of text, using a series of numbers like the code below, but all i get in return is a series of NaN's.
import numpy as np
import pandas as pd
numData = np.array([4,6,4,3,6])
txtData = np.array(['bluebox','yellowbox','greybox','redbox','orangebox'])
n = pd.Series(numData)
t = pd.Series(txtData)
x = t.str[:n]
print (x)
output is
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
I would like the output to be
0 blue
1 yellow
2 grey
3 red
4 orange
Is there an easy way to do this.

You can use a simple list comprehension if in reality you can't chop off the last 3 characters and need to rely on your slice ranges. You will need error handling if your data aren't guaranteed to be all strings, or if end can exceed the length of the string.
pd.Series([x[:end] for x,end in zip(t,n)], index=t.index)
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object

You can pd.Series.str.slice
t.str.slice(stop=-3)
# short hand for this is t.str[:-3]
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object
Or cast numData as an iterator using iter and use slice
it = iter(numData)
t.map(lambda x:x[slice(next(it))])
0 blue
1 yellow
2 grey
3 red
4 orange
dtype: object

numdata_iter = iter(numData)
x = t.apply(lambda text: text[:next(numdata_iter)])
We turn the numData into an iterator and then invoke next on it for each slicing in apply.

Related

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Creating a new column by summing previous ones with pandas error

Here is a picture of my file, the upper one is the original, the bottom one is what I get after I run my code: https://imgur.com/a/zUCGart
Here is my code:
import pandas as pd
path = r'C:\Users\myname\Downloads\RGB.xlsx'
df = pd.read_excel(path)
df['RGB'] = df.iloc[1:7:,:5:8].sum(axis=1)
df.to_excel(path)
Basically what I want to do is to create a new column called RGB, and sum the values of red, blue and green columns, hence I did the 1:7:,:5:8, to apply it to all of the rows, and the 5.,6. and 7. (red,blue,green) columns, but instead it just made RGB equal to the black(first) column...
Not sure what did I do wrong here.
My original dataframe:
Black Orange Yellow Brown Blue Green Red
7 4 3 1 6 7 2
3 3 3 8 4 5 2
6 7 3 2 2 2 5
2 9 2 2 2 2 2
5 5 5 5 5 5 5
2 2 8 2 27 8 2
You have some extra semicolons in your code, at df.iloc[1:7:,:5:8]. Try without them. let me know if it works, otherwise i will come back with a more general solution.
import pandas as pd
path = r'C:\Users\myname\Downloads\RGB.xlsx'
df = pd.read_excel(path)
df['RGB'] = df.iloc[1:7,5:8].sum(axis=1)
df.to_excel(path)
You shoud be able to do it like this.
df['RGB'] = df['Red'] + df['Green'] + df['Blue']

pandas getting highest frequency value for each group in another column

I have a Pandas dataframe like this:
id color size test
0 0 blue medium 1
1 1 blue small 2
2 5 blue small 4
3 2 blue big 3
4 3 red small 4
5 4 red small 5
My desired output is this:
color size
blue small
red small
I've tried:
df = df[['id', 'color', 'size']]
df = df.groupby(['color'])['size'].value_counts()
and get this:
color size
blue small 2
big 1
medium 1
red small 2
Name: size, dtype: int64
but it turns into a series and the columns seem all messed up.
Basically, for each of the groups of 'color', I want the 'size' with the highest frequency. I'm really having a lot of trouble with this. Any suggestions? Thanks so much.
We can do sort_values the groupby with tail
s=df.groupby(['color','size']).size().sort_values().groupby(level=0).tail(1).reset_index()
color size 0
0 blue small 2
1 red small 2

Count the amount of NaNs in each group

In this dataframe, I'm trying to count how many NaN's there are for each color within the color column.
This is what the sample data looks like. In reality, there's 100k rows.
color value
0 blue 10
1 blue NaN
2 red NaN
3 red NaN
4 red 8
5 red NaN
6 yellow 2
I'd like the output to look like this:
color count
0 blue 1
1 red 3
2 yellow 0
You can use DataFrame.isna, GroupBy the column color and sum to add up all True rows in each group:
df.value.isna().groupby(df.color).sum().reset_index()
color value
0 blue 1.0
1 red 3.0
2 yellow 0.0
Also you may use agg() and isnull() or isna() as follows:
df.groupby('color').agg({'value': lambda x: x.isnull().sum()}).reset_index()
Use isna().sum()
df.groupby('color').value.apply(lambda x: x.isna().sum())
color
blue 1
red 3
yellow 0
A usage from size and count
g=df.groupby('color')['value']
g.size()-g.count()
Out[115]:
color
blue 1
red 3
yellow 0
Name: value, dtype: int64

ffill not retaining all columns

I have a df like this:
Key Class
1 Green
1 NaN
1 NaN
2 Red
2 NaN
2 NaN
and I want to forward fill Class. Im using this code:
df=df.Class.fillna(method='ffill')
and this returns:
Green
Green
Green
Red
Red
Red
how can I retain the Key column while doing this?
df['class'] = df.Class.fillna(method='ffill')
in your code you're assigning the whole dataframe to be the result , so instead you need to assign only the class column
or another way is to do the following
In [126]:
df.ffill()
Out[126]:
Key Class
0 1 Green
1 1 Green
2 1 Green
3 2 Red
4 2 Red
5 2 Red
you can set also the inplace parameter to be true if you don't want to create a new copy from your df
df.ffill(inplace=True)

Categories

Resources