Python Pandas Data Frame Inserting Many Arbitrary Values - python

Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?

Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon

Related

How to count duplicate rows in pandas dataframe where the order of the column values is not important?

I wonder if we can extend the logic of How to count duplicate rows in pandas dataframe?, so that we also consider rows which have similar values on the columns with other rows, but the values are unordered.
Imagine a dataframe like this:
fruit1 fruit2
0 apple banana
1 cherry orange
3 apple banana
4 banana apple
we want to produce an output like this:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
Is this possible?
You can directly re-assign the np.sort values like so, then use value_counts():
import numpy as np
df.loc[:] = np.sort(df, axis=1)
out = df.value_counts().reset_index(name='occurences')
print(out)
Output:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
You could use np.sort along axis=1 to sort the values in your rows.
Then it's just the regular groupby.size():
import numpy as np
fruit_cols = ['fruit1','fruit2']
df_sort = pd.DataFrame(np.sort(df.values,axis=1),columns=fruit_cols)
df_sort.groupby(fruit_cols,as_index=False).size()
prints:
fruit1 fruit2 size
0 apple banana 3
1 cherry orange 1

Fill values based on adjacent column

How could I create create_col? For each row, find the previous time where that fruit was mentioned and check if the wanted column was yes?
wanted fruit create_col
0 yes apple
1 pear
2 peear < last time pear was mentioned, wanted was not yes, so blank
3 apple True < last time apple was mentioned, wanted was yes, so True
df
###
wanted fruit
0 yes apple
1 pear
2 yes pear
3 apple
4 mango
5 pear
df['cum_list'] = df[df['wanted'].eq('yes')]['fruit'].cumsum()
df['cum_list'] = df['cum_list'].shift(1).ffill()
df.fillna('', inplace=True)
df['create_col'] = np.where(df.apply(lambda x: x['fruit'] in x['cum_list'], axis=1),True, '')
df.drop(columns=['cum_list'],inplace=True)
df
###
wanted fruit create_col
0 yes apple
1 pear
2 yes pear
3 apple True
4 mango
5 pear True

filter df using key words

I have a df:
item_name price stock
red apple 2 2
green apple 4 1
green grape 4 3
yellow apple 1 2
purple grape 4 1
I have another df:
Key Word Min_stock
red;grape 2
The result I would like to get is:
item_name price stock
red apple 2 2
green grape 4 3
I would like to filter the first df based on the second df, for keyword, I would like to select item_name that contains either key word in Key Word cloumn.
Is there any way to acheive it?
Assuming df1 and df2 the DataFrames, you can compute a regex from the splitted and exploded df2, then extract and map the min values and filter with boolean indexing:
s = (df2.assign(kw=df2['Key Word'].str.split(';'))
.explode('kw')
.set_index('kw')['Min_stock']
)
# red 2
# grape 2
# blue 10
# apple 10
regex = '|'.join(s.index)
# 'red|grape|blue|apple'
mask = df1['item_name'].str.extract(f'({regex})', expand=False).map(s)
# 0 2
# 1 10
# 2 2
# 3 10
# 4 2
out = df1[mask.notna()&df1['stock'].ge(mask)]
output:
item_name price stock
0 red apple 2 2
2 green grape 4 3
NB. for generalization, I used a different df2 as input:
Key Word Min_stock
red;grape 2
blue;apple 10

python: merge two database based on matching multiple columns in both the datasets and apply a script on the result

I have two databases with multiple column dataset-1(df1) has more than a couple of thousand rows and dataset-2(df2) is smaller... 300 rows.
I need to pickup a 'value' from column 3 in df2 based on matching 'fruit' in df1 with 'type' in df2 and 'expiry' in df1 with 'expiry' in df2
Furthermore, Instead of storing the 'Value' directly in a new column in df1, i need to perform a multiplication on the value in each row and the output gets to be stored in a new a column in df1.
So for example if expiry is 2 the value gets multiplied by 2 and if its 3 value gets multiplied by 3.. and so on and so forth!
I was able to solve this by using the code below, but.....:
for i in range(0, len(df1)):
df1_value = df2.loc[(df2['type'] == df1.iloc[i]['fruit']) & (df2['expiry'] == str(df1.iloc[i]['expiry'])].iloc[0]['value']
df1.loc[i, 'df_value'] = df1.iloc[i]['expiry']*df1_value
It creates two issues,
If the iteration throws up a null value (for example there is no 'value' for banana with expiry of 3 in df2), the process stops and it gives me an error -IndexError: single positional indexer is out-of-bounds
Because df1 has a very large number of rows, the individual iterations take a lot of time.
Is there a better way to handle this?
say df1:
fruit expiry category
apple 3 a
apple 3 b
apple 4 c
apple 4 d
orange 2 a
orange 2 b
orange 3 c
orange 3 d
orange 3 e
banana 3 a
banana 3 b
banana 3 c
banana 4 d
pineapple 2 a
pineapple 3 b
pineapple 3 c
pineapple 4 d
pineapple 4 e
df2:
type expiry value
apple 2 100
apple 3 110
apple 4 120
orange 2 200
orange 3 210
orange 4 220
banana 2 310
banana 4 320
pineapple 2 410
pineapple 3 420
pineapple 4 430
output: -revised df1
fruit expiry category df_value
apple 3 a 110*3=330
apple 3 b 110*3=330
apple 4 c 120*4=480
apple 4 d 120*4=480
orange 2 a 200...
orange 2 b 200...
orange 3 c 210...
orange 3 d 210...
orange 3 e 210...
banana 3 a 0
banana 3 b 0
banana 3 c 0
banana 4 d 320*4=1280
pineapple 2 a 410*2=820
pineapple 3 b 420...
pineapple 3 c 420...
pineapple 4 d 430....
pineapple 4 e 430....
As far as I know you can only do this by using SQL within python. SQL is used for relating different databases that have at least one column that is relatable (if you've used Power BI or Tableau you know what I mean) and querying multiple dataframes through their mutual relationships. I do not know this language so I cannot help you further than this.

combine columns containing empty strings into one column in python pandas

I have a dataframe like below.
df=pd.DataFrame({'apple': [1,0,1,0],
'red grape': [1,0,0,1],
'banana': [0,1,0,1]})
I need to create another column with combine these columns and separate with ';', like below:
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1
what I did was I converted 1/0 to string/empty string, then concatenate the columns
df['apple'] = df.apple.apply(lambda x: 'apple' if x==1 else '')
df['red grape'] = df['red grape'].apply(lambda x: 'red grape' if x==1 else '')
df['banana'] = df['banana'].apply(lambda x: 'banana' if x==1 else '')
df['fruits'] = df['apple']+';'+df['red grape']+';'+df['banana']
apple red grape banana fruits
0 apple red grape apple;red grape;
1 banana ;;banana
2 apple apple;;
3 red grape banana ;red grape;banana
The separators all screwed up because of the empty strings. Also I want the solution to be more general. For example, I might have lots of such columns to combine. Do not want to hardcode eveything...
Does anyone know the best way to do this? Thanks a lot.
Use DataFrame.insert for first column with DataFrame.dot for matrix multiplication with separator and last remove separator from right side by Series.str.rstrip:
df.insert(0, 'fruits', df.dot(df.columns + ';').str.rstrip(';'))
print (df)
fruits apple red grape banana
0 apple;red grape 1 1 0
1 banana 0 0 1
2 apple 1 0 0
3 red grape;banana 0 1 1

Categories

Resources