add/merge items to pandas data frame - python

I would like to use pandas for storing some data in a structure like this:
Quantity Date Item Color
15 0312 car green
10 0312 car red
3 0512 car red
and be able to add an item to the structure:
3 0312 car green
and get the structure updated as a result:
Quantity Date Item Color
18 0312 car green
10 0312 car red
3 0512 car red
Another example add:
-3 0512 car red
Result:
Quantity Date Item Color
18 0312 car green
10 0312 car red
If the last 3 columns have the same values the quantity column is updates with a new value.
What is the closest data structure and function in pandas which support that?

When adding DataFrames (+), pandas first aligns the indexes and the adds the values. So you aren't doing addition.
Instead you want use the index labels to select / drop.
First one:
df.loc[3] = [0312, 'car', 'green'] # or however you have that value stored.
Second one:
df = df.drop(3)
I'd encourage you to read the docs on indexing and selecting data

Related

pandas function to check if there exist non-NA values for the same ids?

Assume I have a dataset that contains around 100 000 rows and 50 columns.
I have information about the sellers and their products. The part of the dataset will look somehow like this:
seller_id
product_id
seller_is_checked
size
color
A100
UN76UH
1
uni size
red
B200
HJHLI90
0
small
blue
C300
UUKB89
0
large
green
<...>
<...>
<...>
<...>
<...>
A100
BxYJHG
NA
medium
purple
AXYZ215
HHIOTY
1
large
unknown
In the table you can see that there are at least two seller_ids as these seller has several products. However, this time there was a mistake while entering the data and the information whether the seller_is_checked got missing.
Is there a function in Python/pandas that will help to look through the data set and substitute the missing value with the actual one from the same data set?
A possible solution, which is based on the idea of filling downwards and then upwards the missing values with a valid observation in each group of seller_id (pandas.DataFrame.ffill and pandas.DataFrame.bfill):
df.seller_is_checked = df.groupby('seller_id')['seller_is_checked'].ffill().bfill()
print(df)
Output:
seller_id product_id seller_is_checked size color
0 A100 UN76UH 1.0 uni size red
1 B200 HJHLI90 0.0 small blue
2 C300 UUKB89 0.0 large green
3 A100 BxYJHG 1.0 medium purple
4 AXYZ215 HHIOTY 1.0 large unknown
You can do this using pandas for example:
import pandas as pd
# Read the data into DataFrame which is basically a two dimensional array
df = pd.read_csv("you_csv_file.csv")
# Print if there are null values
print(df.isna().sum())
You can solve this by creating a dictionary for seller_id and then updating the seller_is_checked info. Follow me (assuming you are using pandas):
1 - remove lines where seller_is_checked info is missing and create a new dataset, seller_dict_df with the results
seller_dict_df = df.dropna()
2 - create the dictionary
seller_dict = dict(
zip(
seller_dict_df['seller_id'], seller_dict_df['seller_is_checked']
)
)
3 - update the original table
df['seller_is_checked'] = df['seller_is_checked'].replace(seller_dict)

Using pandas to sum columns based on a criteria

I am trying to use pandas to group sales information based on category and a criteria.
For example in "Table 1" below, I want sales totals for each category excluding those with a "Not Stated" in the Reg/Org column. My ideal output would be in "Table 2" below. My actual data set has 184 columns, and I am trying to capture the sales volume by category across any values excluding those that are "Not Stated".
Thank you for any help or direction that you can provide.
TABLE 1
Category
Reg/Org
Sales
Apple
Regular
10
Apple
Organic
5
Apple
Not Stated
5
Banana
Regular
15
Banana
Organic
5
TABLE 2
Category
Reg/Org
Apple
15
Banana
20
The first part was to summarize the values by column for the entire data set. I utilized the code below to gather that info for each of the 184 columns. Now I want to create a further summary where I create those column totals again, but split by the 89 categories I have. Ideally, I am trying to create a cross tab, where the categories are listed down the rows, and each of the 184 columns contains the sales. (e.g. the column "Reg/Org" would no longer show "Organic" or "Regular", it would just show the sales volume for all values that are not "Not Stated".)
att_list = att.columns.tolist()
ex_list = ['NOT STATED','NOT COLLECTED']
sales_list = []
for att_col in att_list:
sales_list.append(att[~att[att_col].isin(ex_list)]['$'].sum())
Try
df[df["Reg/Org"]!="Not Stated"].groupby("Category").sum()
Or
df.groupby("Category").sum().drop(index= ["Not Stated"])
try using "YourDataframe.loc[]" with a filter inside
import pandas as pd
data = pd.read_excel('Test_excel.xlsx')
sales_volum = data.loc[data["Reg/Org"] != "Not Stated"]
print(str(sales_volum))

How to append value to list in a row based on another column python?

I have a dataframe that looks like:
body label
the sky is blue [noun]
the apple is red. [noun]
Let's take a walk [verb]
I want to add an item to the list in label depending on if there is a color in the body column.
Desired Output:
body label
the sky is blue [noun, color]
the apple is red. [noun, color]
Let's take a walk [verb]
I have tried:
data.loc[data.body.str.contains("red|blue"), 'label'] = data.label.str.append('color')
One option is to use apply on the Series and then directly append to list:
data.loc[data.body.str.contains('red|blue'), 'label'].apply(lambda lst: lst.append('color'))
data
body label
0 the sky is blue [noun, color]
1 the apple is red. [noun, color]
2 Let's take a walk [verb]

How to take only the first 30 rows in plotly express python

I have an excel file with a column named 'Product' and 'Quantity'. In the Product column, there are over 100 different items (clothes, shoes, caps, hats ,etc) while the Quantity column shows how many of those products were sold.
**Product** **Quantity**
Shirt A 2
Shirt A 5
Shirt C 1
Shirt A 9
Shoes B 3
I want to group all different items and count their total quantity but only for the 25 most sold products. in pandas it would be like this:
df = pd.read_csv('directory\Sales.csv')
df_products = df[['Product',
'Quantity']].groupby('Product').sum().head(25).sort_values(by='Quantity', ascending=False)
but how can I do this exact same thing in a histogram graph made in plotly.express? I tried this:
fig_product = px.histogram(data_frame=df_products, x='Product', y='Quantity')
This shows me all +100 products name and their quantities sold, but I only want the top 25 of those to show up for me. How can I do that?
It's all in dataframe preparation
groupby().sum() to get the totals required
sort_values().head() for number of items you want to plot. I've picked top 10 in this example
there is no difference between histogram and Bar
import plotly.graph_objects as go
import plotly.express as px
df = pd.DataFrame({"product":np.random.choice(list("abcdefghijklmnonpqrstuvwxyz"), 200), "quantity":np.random.uniform(3,5,200)})
df = df.groupby("product", as_index=False).sum().sort_values("quantity", ascending=False).head(10)
go.Figure(go.Bar(x=df["product"], y=df["quantity"]))
px.histogram(data_frame=df, x='product', y='quantity')

Python and pandas, groupby only column in DataFrame

I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values ​​it gives me the correct grouping of the tipology column)
Also in the date column it gives me values ​​(I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values ​​to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()

Categories

Resources