pandas.DataFrame.mul matching on other column

pandas.DataFrame.mul matching on other column - python

I have a pandas DataFrame (ignore the indices of the DataFrame)
Tab Ind Com Val
4 BAS 1 1 10
5 BAS 1 2 5
6 BAS 2 1 20
8 AIR 1 1 5
9 AIR 1 2 2
11 WTR 1 1 2
12 WTR 2 1 1
And a pandas series
Ind
1 1.208333
2 0.857143
dtype: float64
I want to multiply each element of the Val column of the DataFrame with the element of the series that has the same Ind value. How would I approach this? pandas.DataFrame.mul only matches on index, but I don't want to transform the DataFrame.

Looks like pandas.DataFrame.join could solve your problem:
temp = df.join(the_series,on='Ind', lsuffix='_orig')
df['ans'] = temp.Val*temp.Ind
Output
Tab Ind Com Val ans
4 BAS 1 1 10 12.083330
5 BAS 1 2 5 6.041665
6 BAS 2 1 20 17.142860
8 AIR 1 1 5 6.041665
9 AIR 1 2 2 2.416666
11 WTR 1 1 2 2.416666
12 WTR 2 1 1 0.857143
Or another way to achieve the same using a more compact syntax (thanks W-B)
df1['New']=df1.Ind.map(the_series).values*df1.Val

Related

pandas, expand series of dataframes

I have a series that looks like this:
result
3 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
8 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
11 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
14 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
17 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
20 pd.DataFrame({"ABC":1,"American":2,"Heroes":3})
How do I produce this result:
ABC American Heroes
3 1 2 3
8 1 2 3
11 1 2 3
14 1 2 3
17 1 2 3
20 1 2 3
This is driving me crazy, cuz if concat I loose my index.
here's my closest try pd.concat(myDf.tolist(), axis=1)

This is a pretty convoluted structure, I tried reconstructing your series of dataframes this way (I don't see any series with this structure in the link you point to):
df_list = [pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]}),
pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]}),
pd.DataFrame({"ABC":[1],"American":[2],"Heroes":[3]})]
series = pd.Series(df_list)
And to get what you want:
df = pd.DataFrame(series\
.apply(lambda x : x.squeeze().to_list())\
.to_list(),
columns=series[0].columns)
Results:
ABC American Heroes
0 1 2 3
1 1 2 3
2 1 2 3

Python - Adding rows to timeseries dataset

I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7

Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20

Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0

you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Rearrange misplaced columns

I want to ask about a data cleansing question which I suppose python may be more efficient. The data have a lot of mis-placed columns and I have to use some characteristics based on certain columns to place them to the right positions. Below is an example in Stata code:
forvalues i = 20(-1)2{
local j = `i' + 25
local k = `j' - 2
replace v`j' = v`k' if substr(v23, 1, 4) == "1980"
}
That is, I move contents in columns v25 - v43 backward by 2 if the observation in column v23 starts with "1980". Otherwise, the columns are correct.
Any help is appreciated.

The following is a simplified example to show it works:
In [65]:
# create some dummy data
import pandas as pd
import io
pd.set_option('display.notebook_repr_html', False)
temp = """v21 v22 v23 v24 v25 v28
1 1 19801923 1 5 8
1 1 20003 1 5 8
1 1 9129389 1 5 8
1 1 1980 1 5 8
1 1 1923 2 5 8
1 1 9128983 1 5 8"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
df
Out[65]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 5 8
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 5 8
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
In [68]:
# I have to convert my data to a string in order for this to work, it may not be necessary for you in which case the following commented out line would work for you:
#df.v23.str.startswith('1980')
df.v23.astype(str).str.startswith('1980')
Out[68]:
0 True
1 False
2 False
3 True
4 False
5 False
Name: v23, dtype: bool
In [70]:
# now we can call shift by 2 along the column axis to assign the values back
df.loc[df.v23.astype(str).str.startswith('1980'),['v25','v28']] = df.shift(2,axis=1)
df
Out[70]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 19801923 1
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 1980 1
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
So what you need to do is define the list of columns up front:
In [72]:
target_cols = ['v' + str(x) for x in range(25,44)]
print(target_cols)
['v25', 'v26', 'v27', 'v28', 'v29', 'v30', 'v31', 'v32', 'v33', 'v34', 'v35', 'v36', 'v37', 'v38', 'v39', 'v40', 'v41', 'v42', 'v43']
Now substitute this back into my method and I believe it should work:
df.loc[df.v23.astype(str).str.startswith('1980'),target_cols] = df.shift(2,axis=1)
See shift to understand the params

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

I have a dataframe from a multiple choice questions and it is formatted like so:
Sex Qu1 Qu2 Qu3
Name
Bob M 1 2 1
John M 3 3 5
Alex M 4 1 2
Jen F 3 2 4
Mary F 4 3 4
The data is a rating from 1 to 5 for the 3 multiple choice questions. I want rearrange the data so that the index is range(1,6) where 1='bad', 2='poor', 3='ok', 4='good', 5='excellent', the columns are the same and the data is the count of the number occurrences of the values (excluding the Sex column). This is basically a histogram of fixed bin sizes and the x-axis labeled with strings. I like the output of df.plot() much better than df.hist() for this but I can't figure out how to rearrange the table to give me a histogram of data. Also, how do you change x-labels to be strings?

Series.value_counts gives you the histogram you're looking for:
In [9]: df['Qu1'].value_counts()
Out[9]:
4 2
3 2
1 1
So, apply this function to each of those 3 columns:
In [13]: table = df[['Qu1', 'Qu2', 'Qu3']].apply(lambda x: x.value_counts())
In [14]: table
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 NaN 2 1
3 2 2 NaN
4 2 NaN 2
5 NaN NaN 1
In [15]: table = table.fillna(0)
In [16]: table
Out[16]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
Using table.reindex or table.ix[some_array] you can rearrange the data.
To transform to strings, use table.rename:
In [17]: table.rename(index=str)
Out[17]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
In [18]: table.rename(index=str).index[0]
Out[18]: '1'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas.DataFrame.mul matching on other column - python

Related

pandas, expand series of dataframes

Python - Adding rows to timeseries dataset

Conditional sum from rows into a new column in pandas

Rearrange misplaced columns

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

Categories

Resources