Using Pandas to calculate the difference between column values - python

I have a csv with two columns, Dates and Profits/Losses that I have read into the data frame.
import os
import csv
import pandas as pd
cpath = os.path.join('..', 'Resources', 'budget_data.csv')
df = pd.read_csv(cpath)
df["Profit/Losses"]= df["Profit/Losses"].astype(int)
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
I want to know the differences of profits and losses per month (with each row being one month) and so thought to use df.diff to calculate the values
df.diff()
This results however in errors as I think it is trying to calculate the dates column as well and I'm not sure how to make it only calculate the profits and losses.

Is this what you are looking for?
import pandas as pd
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
data.assign(Delta=lambda d: d["Profit/Losses"].diff().fillna(0))
Yields
Dates Profit/Losses Delta
0 2019-01-01 40 0
1 2019-02-01 -5 -45.0
2 2019-03-01 15 20.0

Maybe you can do this:
import pandas as pd
x = [[1,2], [1,2], [1,4]]
d = pd.DataFrame(x, columns=['loss', 'profit'])
d.insert(0, "diff", [d['profit'][i] - d['loss'][i] for i in d.index])
d.head()
Gives:

Related

python: divide a dataframe into the same intervals as another dataframe

I divided the following dataframe into 4 intervals according to the 'ages' column.
Let's say that I want another dataframe to have the same exact intervals, is there a quick way to do so?
In other words, the following lines
df1['age_groups'] = pd.cut(df1.ages,4)
print(df1['age_groups'])
divides the dataframe into the following intervals
(1.944, 16.0] 5
(16.0, 30.0] 3
(30.0, 44.0] 2
(44.0, 58.0] 2
but if I have a different dataframe with slightly different numbers in a column with the same name, the same code will produce different intervals.
How do I make sure I can subdivide other dataframes into the same intervals?
ages=[35.000000,
2.000000,
27.000000,
14.000000,
4.000000,
58.000000,
20.000000,
39.000000,
14.000000,
55.000000,
2.000000,
29.699118]
values=[1,0,1,1,0,0,0,1,0,0,1,1]
df1=pd.DataFrame()
df1['ages']=ages
df1['values']=values
#print(df1)
df1['age_groups'] = pd.cut(df1.ages,4)
Save the bins from the first DataFrame using the retbins keyword
Use it as the bins argument in for the second DataFrame:
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
Working example:
import numpy as np
import pandas as pd
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
>>> df1.head()
ages age_groups
0 18 (11.935, 28.25]
1 34 (28.25, 44.5]
2 77 (60.75, 77.0]
3 58 (44.5, 60.75]
4 20 (11.935, 28.25]
>>> df2.head()
ages age_groups
0 11 NaN
1 23 (11.935, 28.25]
2 14 (11.935, 28.25]
3 69 (60.75, 77.0]
4 77 (60.75, 77.0]

How do I convert 2 column array(randomly generated) to a DataFrame?

Using a numpy random number generator, generate arrays on height and weight of the 88,000 people living in Utah.
The average height is 1.75 metres and the average weight is 70kg. Assume standard deviation on 3.
Combine these two arrays using column_stack method and convert it into a pandas DataFrame with the first column named as 'height' and the second column named as 'weight'
I've gotten the randomly generated data. However, I can't seem to convert the array to a DataFrame
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
print(Utah)
df = pd.DataFrame(
[[np_height],
[np_weight]],
index = [0, 1],
columns = ['height', 'weight'])
print(df)
You want 2 columns, yet you passed data [[np_height],[np_weight]] as 1 column. You can set the data as dict.
df = pd.DataFrame({'height':np_height,
'weight':np_weight},
columns = ['height', 'weight'])
print(df)
The data in Utah is already in a suitable shape. Why not use that?
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
df = pd.DataFrame(
data=Utah,
columns=['height', 'weight']
)
print(df.head())
height weight
0 3.57 65.32
1 -0.15 66.22
2 5.65 73.11
3 2.00 69.59
4 2.67 64.95

Array in DataFrame Panda Python

I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132
You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df

Python Linear Regression input values

I have a Excel sheet with 2 colums and 1000 rows.
I want to give this as inputs to my Linear Regression Fit command using the sklearn.
/
when I want to create a dataframe using panda how can I give the inputs?
like df_x=pd.dataFrame(...)
I used without dataframe sucessfully as:
npMatrix=np.matrix(raw_data)
X,Y=npMatrix[:,1],npMatrix[:,2]
md1=LinearRegression().fit(X,Y)
Can you help with me Pandas how to access the rows?
I think you can convert a pandas dataframe to a numpy array by np.array()
This is discussed here: Quora: How does python-pandas go along with scikit-learn library?
The example, by Muktabh Mayank, is copied below:
>>> from pandas import *
>>> from numpy import *
>>> new_df = DataFrame(array([[1,2,3,4],[5,6,7,8],[9,8,10,11],[16,45,67,88]]))
>>> new_df.index= ["A1","A2","A3","A4"]
>>> new_df.columns= ["X1","X2","X3","X4"]
>>> new_df
X1 X2 X3 X4
A1 1 2 3 4
A2 5 6 7 8
A3 9 8 10 11
A4 16 45 67 88
>>> array(new_df)
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 8, 10, 11],
[16, 45, 67, 88]], dtype=int64)
>>>
And btw, people are actually working on bridging sklearn and pandas: sklearn-pandas
You can read excel
df = pd.read_excel(...)
You can single column using column number
X = df[0]
Y = df[1]
If columns have names ie. "column1", "column2"
X = df["column1"]
Y = df["column2"]
But it gives single column as Series.
If you need single column as DataFrame then use list of columns
X = df[ [0] ]
Y = df[ [1] ]
More: How to get column by number in Pandas?

How to plot stacked & normalized histograms?

I have a dataset that maps continuous values to discrete categories. I want to display a histogram with the continuous values as x and categories as y, where bars are stacked and normalized. Example:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
},
columns=['score', 'category'])
print df.head(10)
Output:
score category
0 0.649371 B
1 0.042309 B
2 0.689487 A
3 0.433064 B
4 0.978859 A
5 0.789140 C
6 0.215758 D
7 0.922389 B
8 0.105364 D
9 0.010274 C
If I try to plot this as a histogram using df.hist(by='category'), I get 4 graphs:
I managed to get the graph I wanted but I had to do a lot of manipulation.
# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
'score' : df.score,
'A' : (df.category == 'A').astype(float),
'B' : (df.category == 'B').astype(float),
'C' : (df.category == 'C').astype(float),
'D' : (df.category == 'D').astype(float)
},
columns=['score', 'A', 'B', 'C', 'D'])
# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])
# Sum over series for weights
df4 = df3.sum(1)
bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))
bars.plot.bar(stacked=True)
I expect there is a more straightforward way to do this, easier to read and understand and more optimized with less intermediate steps. Any solutions?
I dont know if this is really that much more compact or readable than what you already got but it is a suggestion (a late one as such :)).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
}, columns=['score', 'category'])
# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)
# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size()
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)
# Plot
df_a.unstack().plot.bar(stacked=True)
Consider assigning bins with cut, calculating grouping percentages with couple of groupby().transform calls, and then aggregate and reshape with pivot_table:
# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1),
labels=np.arange(0,1,0.1)).round(1)
# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
.div(df.groupby(['plot_bins'])['score'].transform('count')))
# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
.reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)

Categories

Resources