How to improve dataframe creation from arrays in Pandas - python

I have two arrays A and B which contain a series of numbers.
My goal is to create a dataframe having the following structure:
for each element of B I want to correspond all the values of A.
For example:
if A = [0,2,5] and B=[4,9,8] I want to obtain the following pairs: 0-4,0-9,0-8, 2-4,2-9,2-8 and 5-4,5-9,5-8.
I was able to achieve my goal in the following way:
import pandas as pd
import numpy as np
a, b = 1, 10
c, d = -10, -1
step = 0.5
A = np.arange(a,b,1)+step
B = np.arange(c,d,1)
df = pd.DataFrame()
for j in B:
for i in A:
name = 'H'+str(int(np.abs(i)))+str(int(np.abs(j)))
dic = {'XXX':[i],'YYY':[j],'ZZZ':name}
df = pd.concat([df,pd.DataFrame(dic)],ignore_index=True)
Column ZZZ but be calculated as shown above.
The code I wrote works fine but it is pretty slow when I increase the values of a,b,c,d.
Is there a more elegant way to achieve my goal? I would like to avoid nested for loops and it should be a more efficient way than mine obviously.

You can create all combination by itertools.product.
For column XXX convert float to int and then to str for remove decimal, fom column YYY get absolute value and cast to str:
from itertools import product
df = pd.DataFrame(list(product(B, A)), columns=['YYY','XXX'])
#swap columns
df = df[['XXX','YYY']]
df['ZZZ'] = 'H' + df.XXX.astype(int).astype(str) + df.YYY.abs().astype(str)
print (df.head(20))
XXX YYY ZZZ
0 1.5 -10 H110
1 2.5 -10 H210
2 3.5 -10 H310
3 4.5 -10 H410
4 5.5 -10 H510
5 6.5 -10 H610
6 7.5 -10 H710
7 8.5 -10 H810
8 9.5 -10 H910
9 1.5 -9 H19
10 2.5 -9 H29
11 3.5 -9 H39
12 4.5 -9 H49
13 5.5 -9 H59
14 6.5 -9 H69
15 7.5 -9 H79
16 8.5 -9 H89
17 9.5 -9 H99
18 1.5 -8 H18
19 2.5 -8 H28

Related

Python: Create some sort of cumsum between two columns

I am trying to figure out how to get some sort of running total using multiple columns and I can't figure out where to even start. I've used cumsum before but only for just one single column and this won't work.
I have this table :
Index A B C
1 10 12 20
2 10 14 20
3 10 6 20
I am trying to build out this table that looks like this:
Index A B C D
1 10 12 20 10
2 10 14 20 18
3 10 6 20 24
The formula for D is as follows:
D2 = ( D1 - B1 ) + C1
D1 = Column A
Any ideas on how I could do this? I am totally out of ideas on this.
This should work:
df.loc[0, 'New_Inventory'] = df.loc[0, 'Inventory']
for i in range(1, len(df)):
df.loc[i, 'New_Inventory'] = df.loc[i-1, 'Inventory'] - df.loc[i-1, 'Booked'] - abs(df.loc[i-1, 'New_Inventory'])
df.New_Inventory = df.New_Inventory.astype(int)
df
# Index Inventory Booked New_Inventory
#0 1/1/2020 10 12 10
#1 1/2/2020 10 14 -12
#2 1/3/2020 10 6 -16
You can get your answer by using shift, reference the answer here
import pandas as pd
raw_data = {'Index': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020'],
'Inventory': [10, 10, 10, 10, 10],
'Booked': [12,14,6,3,5] }
df = pd.DataFrame(raw_data)
df['New_Inventory'] = 10 # need to initialize
df['New_Inventory'] = df['Inventory'] - df['Booked'].shift(1) - df['New_Inventory'].shift(1)
df
Your requested output seems wrong. The calculation for above New_Inventory is what was requested.

How to create a column using a function based of previous values in the column in python

My Problem
I have a loop that creates a value for x in time period t based on x in time period t-1. The loop is really slow so i wanted to try and turn it into a function. I tried to use np.where with shift() but I had no joy. Any idea how i might be able to get around this problem?
Thanks!
My Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('y_list.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df.loc[df.index[0], 'var'] = 0
for x in range(1,len(df.index)):
if df["LAST"].iloc[x] > 0:
df["var"].iloc[x] = ((df["var"].iloc[x - 1] * 2) + df["LAST"].iloc[x]) / 3
else:
df["var"].iloc[x] = (df["var"].iloc[x - 1] * 2) / 3
df
Input Data
Dates,LAST
03/09/2018,-7
04/09/2018,5
05/09/2018,-4
06/09/2018,5
07/09/2018,-6
10/09/2018,6
11/09/2018,-7
12/09/2018,7
13/09/2018,-9
Output
Dates,LAST,var
03/09/2018,-7,0.000000
04/09/2018,5,1.666667
05/09/2018,-4,1.111111
06/09/2018,5,2.407407
07/09/2018,-6,1.604938
10/09/2018,6,3.069959
11/09/2018,-7,2.046639
12/09/2018,7,3.697759
13/09/2018,-9,2.465173
You are looking at ewm:
arg = df.LAST.clip(lower=0)
arg.iloc[0] = 0
arg.ewm(alpha=1/3, adjust=False).mean()
Output:
0 0.000000
1 1.666667
2 1.111111
3 2.407407
4 1.604938
5 3.069959
6 2.046639
7 3.697759
8 2.465173
Name: LAST, dtype: float64
You can use df.shift to shift the dataframe be a default of 1 row, and convert the if-else block in to a vectorized np.where:
In [36]: df
Out[36]:
Dates LAST var
0 03/09/2018 -7 0.0
1 04/09/2018 5 1.7
2 05/09/2018 -4 1.1
3 06/09/2018 5 2.4
4 07/09/2018 -6 1.6
5 10/09/2018 6 3.1
6 11/09/2018 -7 2.0
7 12/09/2018 7 3.7
8 13/09/2018 -9 2.5
In [37]: (df.shift(1)['var']*2 + np.where(df['LAST']>0, df['LAST'], 0)) / 3
Out[37]:
0 NaN
1 1.666667
2 1.133333
3 2.400000
4 1.600000
5 3.066667
6 2.066667
7 3.666667
8 2.466667
Name: var, dtype: float64

index counter for if conditions python pandas

I wanted to generate some sort of cycle for my dataFrame. One cycle in the example below has the length of 4. The last column is how is supposed to look like, the rest are attempts on my behalf.
My current code looks like this:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
i = 0
for i in range(0,length):
df.loc[i,'new_cycle']=i+1
df['want_cycle']= [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
I do need an if conditions in the code, too only increase in the value of df['new_cycle'] if the index counter for example 4. But so far I failed to find a proper way to implement such conditions.
Try this with the default range index, because your dataframe row index is a range starting with 0, the default index of a dataframe, you can use floor divide to calculate your cycle:
df['cycle'] = df.index//4 + 1
Output:
time A B cycle
0 0.000000 0.0 0 1
1 0.909091 5.0 300 1
2 1.818182 0.6 20 1
3 2.727273 -4.8 -280 1
4 3.636364 -0.3 -25 2
5 4.545455 4.9 290 2
6 5.454545 0.2 30 2
7 6.363636 -4.7 -270 2
8 7.272727 0.5 40 3
9 8.181818 5.0 300 3
10 9.090909 0.1 -10 3
11 10.000000 -4.6 -260 3
Now, if your dataframe index isn't the default, the you can use something like this:
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
I've added just 1 thing for you, a new variable called new_cycle which will keep the count you're after.
In the for loop we're checking to see whether or not i is divisible by 4 without a remainder, if it is we're adding 1 to the new variable, and filling the data frame with this value the same way you did.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
new_cycle = 0
for i in range(0,length):
if i % 4 == 0:
new_cycle += 1
df.loc[i,'new_cycle']= new_cycle
df['want_cycle'] = [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)

Subtracting group-wise mean from a matrix or data frame in python (the "within" transformation for panel data)

In datasets where units are observed multiple times, many statistical methods (particularly in econometrics) apply a transformation to the data in which the group-wise mean of each variable is subtracted off, creating a dataset of unit-level (non-standardized) anomalies from a unit level mean.
I want to do this in Python.
In R, it is handled quite cleanly by the demeanlist function in the lfe package. Here's an example dataset, which a grouping variable fac:
> df <- data.frame(fac = factor(c(rep("a", 5), rep("b", 6), rep("c", 4))),
+ x1 = rnorm(15),
+ x2 = rbinom(15, 10, .5))
> df
fac x1 x2
1 a -0.77738784 6
2 a 0.25487383 4
3 a 0.05457782 4
4 a 0.21403962 7
5 a 0.08518492 4
6 b -0.88929876 4
7 b -0.45661751 5
8 b 1.05712683 3
9 b -0.24521251 5
10 b -0.32859966 7
11 b -0.44601716 3
12 c -0.33795597 4
13 c -1.09185690 7
14 c -0.02502279 6
15 c -1.36800818 5
And the transformation:
> library(lfe)
> demeanlist(df[,c("x1", "x2")], list(df$fac))
x1 x2
1 -0.74364551 1.0
2 0.28861615 -1.0
3 0.08832015 -1.0
4 0.24778195 2.0
5 0.11892725 -1.0
6 -0.67119563 -0.5
7 -0.23851438 0.5
8 1.27522996 -1.5
9 -0.02710938 0.5
10 -0.11049653 2.5
11 -0.22791403 -1.5
12 0.36775499 -1.5
13 -0.38614594 1.5
14 0.68068817 0.5
15 -0.66229722 -0.5
In other words, the following numbers are subtracted from groups a, b, and c:
> library(doBy)
> summaryBy(x1+x2~fac, data = df)
fac x1.mean x2.mean
1 a -0.03374233 5.0
2 b -0.21810313 4.5
3 c -0.70571096 5.5
I'm sure I could figure out a function to do this, but I'll be calling it thousands of times on very large datasets, and would like to know if something fast and optimized has already been built, or is obvious to construct.

How to manipulate data from file and sort python

I have a .dat file with this information inside (but the real file with thousans of lines):
n a (au) k0 k1 P1 k2
1 3.156653 2 3 5 -18
2 3.152517 2 5 5 -23
3 3.154422 3 -18 5 29
4 3.151668 3 -16 5 24
5 3.158629 5 -19 5 21
6 3.156970 5 -17 5 16
7 3.155314 5 -15 5 11
8 3.153660 5 -13 5 6
9 3.152007 5 -11 5 1
10 3.150357 5 -9 5 -4
I load the data by:
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
n = array([])
a = array([])
k0 = array([])
k1 = array([])
p1 = array([])
k2 = array([])
p2 = array([])
l = np.loadtxt('pascal.dat', skiprows=1, usecols=(0,1,2,3,4,5)).T
n=append(n,l[0])
a=append(a,l[1])
k0=append(k0,l[2])
p1=append(p1,l[3])
k1=append(k1,l[4])
p2=append(p2,l[5])
I want to use the values of the column "a(au)" to compute the distance of each element of the "n" column from the a given center, thus:
center = 3.15204
for i in range(len(n)):
distance = abs(center-a[i]))
Well, now I want to re-write the .dat file taking into account the value of distance. Therefore, I want to add a new column called "distance" and then I want to sort all the n rows as function of this new parameter, being the smallest (closest to the center) first and so on.
Any suggestion?
I suggest using the pandas library. Read the .dat file in as a dataframe - it's a very powerful tool through which you can manipulate data, add columns, etc.
import pandas as pd
with open('../pascal.dat') as f:
df = pd.Dataframe(f)
center = 3.15
df['distance'] = abs(3.15 - df['a (au)'])

Categories

Resources