How to manipulate data from file and sort python - python

I have a .dat file with this information inside (but the real file with thousans of lines):
n a (au) k0 k1 P1 k2
1 3.156653 2 3 5 -18
2 3.152517 2 5 5 -23
3 3.154422 3 -18 5 29
4 3.151668 3 -16 5 24
5 3.158629 5 -19 5 21
6 3.156970 5 -17 5 16
7 3.155314 5 -15 5 11
8 3.153660 5 -13 5 6
9 3.152007 5 -11 5 1
10 3.150357 5 -9 5 -4
I load the data by:
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
n = array([])
a = array([])
k0 = array([])
k1 = array([])
p1 = array([])
k2 = array([])
p2 = array([])
l = np.loadtxt('pascal.dat', skiprows=1, usecols=(0,1,2,3,4,5)).T
n=append(n,l[0])
a=append(a,l[1])
k0=append(k0,l[2])
p1=append(p1,l[3])
k1=append(k1,l[4])
p2=append(p2,l[5])
I want to use the values of the column "a(au)" to compute the distance of each element of the "n" column from the a given center, thus:
center = 3.15204
for i in range(len(n)):
distance = abs(center-a[i]))
Well, now I want to re-write the .dat file taking into account the value of distance. Therefore, I want to add a new column called "distance" and then I want to sort all the n rows as function of this new parameter, being the smallest (closest to the center) first and so on.
Any suggestion?

I suggest using the pandas library. Read the .dat file in as a dataframe - it's a very powerful tool through which you can manipulate data, add columns, etc.
import pandas as pd
with open('../pascal.dat') as f:
df = pd.Dataframe(f)
center = 3.15
df['distance'] = abs(3.15 - df['a (au)'])

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

plot line between points pandas

I would like to plot lines between two points and my points are defined in different columns.
#coordinates of the points
#point1(A[0],B[0])
#point2(C[0],D[0])
#line between point1 and point 2
#next line would be
#point3(A[1],B[1])
#point4(C[1],D[1])
#line between point3 and point 4
plot_result:
A B C D E F
0 0 4 7 1 5 1
1 2 5 8 3 3 1
2 3 4 9 5 6 1
3 4 5 4 7 9 4
4 6 5 2 1 2 7
5 1 4 3 0 4 7
i tried with this code:
import numpy as np
import matplotlib.pyplot as plt
for i in range(0, len(plot_result.A), 1):
plt.plot(plot_result.A[i]:plot_result.B[i], plot_result.C[i]:plot_result.D[i], 'ro-')
plt.show()
but it is a invalid syntax. I have no idea how to implement this
The first two parameters of the method plot are x and y which can be single points or array-like objects. If you want to plot a line from the point (x1,y1) to the point (x2,y2) you have to do something like this:
for plot_result in plot_result.values: # if plot_results is a DataFrame
x1 = row[0] # A[i]
y1 = row[1] # B[i]
x2 = row[2] # C[i]
y2 = row[3] # D[i]
plt.plot([x1,x2],[y1,y2]) # plot one line for every row in the DataFrame.

returning result of a function multiple times within the same function

I am trying to return a list of numbers that add up to 100 .. 11 times.
There are 3 numbers which are generated from a numpy random uniform distribution.
I want to add an if statement to see if the 1st, 2nd and 3rd numbers of each list (11 in total).. if plotted would have a Pearson correlation coefficient of >0.99.
At the moment, I am only able to generate 1 list of numbers which have a sum equal to 100.
I have following code:
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
c1_high = 98
c1_low = 75
c2_high = 15
c2_low = 6
c3_high = 8
c3_low = 2
def mix_gen():
while True:
c1 = np.random.uniform(c1_low, c1_high)
c2 = np.random.uniform(c2_low, c2_high)
c3 = np.random.uniform(c3_low, c3_high)
tot = c1+c2+c3
if 99.99<= tot <=100.01:
comp_list = [c1,c2,c3]
return comp_list
my_list = mix_gen()
print(my_list)
so if i was to plot each component.. for example c1... i would get an R^2 value of >0.99.
I'm stuck at generating multiple lists inside the same function. I know this can be done outside the function.. using [mix_gen() for _ in range(11)].. but this will not work because I require this additional check of the peasron corr coeff before returning the 11 lists.
OBJECTIVE:
to return a dataframe with the following values:
C1 C2 C3 sum
1 70 20 10 100
2 ..
3 ..
4 ..
5 ..
6 ..
7 ..
8 ..
9 ..
10 ..
11 90
R^2 1 1 1
This can be an option using a list of lists for the return
def mix_gen(number):
flag = 0
container = []
while flag < number:
c1 = np.random.uniform(c1_low, c1_high)
c2 = np.random.uniform(c2_low, c2_high)
c3 = np.random.uniform(c3_low, c3_high)
tot = c1+c2+c3
if 99.99 <= tot <= 100.01:
flag += 1
container.append([c1,c2,c3])
return container
You call it with
my_list_of_lists = mix_gen(11)

How to improve dataframe creation from arrays in Pandas

I have two arrays A and B which contain a series of numbers.
My goal is to create a dataframe having the following structure:
for each element of B I want to correspond all the values of A.
For example:
if A = [0,2,5] and B=[4,9,8] I want to obtain the following pairs: 0-4,0-9,0-8, 2-4,2-9,2-8 and 5-4,5-9,5-8.
I was able to achieve my goal in the following way:
import pandas as pd
import numpy as np
a, b = 1, 10
c, d = -10, -1
step = 0.5
A = np.arange(a,b,1)+step
B = np.arange(c,d,1)
df = pd.DataFrame()
for j in B:
for i in A:
name = 'H'+str(int(np.abs(i)))+str(int(np.abs(j)))
dic = {'XXX':[i],'YYY':[j],'ZZZ':name}
df = pd.concat([df,pd.DataFrame(dic)],ignore_index=True)
Column ZZZ but be calculated as shown above.
The code I wrote works fine but it is pretty slow when I increase the values of a,b,c,d.
Is there a more elegant way to achieve my goal? I would like to avoid nested for loops and it should be a more efficient way than mine obviously.
You can create all combination by itertools.product.
For column XXX convert float to int and then to str for remove decimal, fom column YYY get absolute value and cast to str:
from itertools import product
df = pd.DataFrame(list(product(B, A)), columns=['YYY','XXX'])
#swap columns
df = df[['XXX','YYY']]
df['ZZZ'] = 'H' + df.XXX.astype(int).astype(str) + df.YYY.abs().astype(str)
print (df.head(20))
XXX YYY ZZZ
0 1.5 -10 H110
1 2.5 -10 H210
2 3.5 -10 H310
3 4.5 -10 H410
4 5.5 -10 H510
5 6.5 -10 H610
6 7.5 -10 H710
7 8.5 -10 H810
8 9.5 -10 H910
9 1.5 -9 H19
10 2.5 -9 H29
11 3.5 -9 H39
12 4.5 -9 H49
13 5.5 -9 H59
14 6.5 -9 H69
15 7.5 -9 H79
16 8.5 -9 H89
17 9.5 -9 H99
18 1.5 -8 H18
19 2.5 -8 H28

How to draw bar in python

I want to draw bar chart for below data:
4 1406575305 4
4 -220936570 2
4 2127249516 2
5 -1047108451 4
5 767099153 2
5 1980251728 2
5 -2015783241 2
6 -402215764 2
7 927697904 2
7 -631487113 2
7 329714360 2
7 1905727440 2
8 1417432814 2
8 1906874956 2
8 -1959144411 2
9 859830686 2
9 -1575740934 2
9 -1492701645 2
9 -539934491 2
9 -756482330 2
10 1273377106 2
10 -540812264 2
10 318171673 2
The 1st column is the x-axis and the 3rd column is for y-axis. Multiple data exist for same x-axis value. For example,
4 1406575305 4
4 -220936570 2
4 2127249516 2
This means three bars for 4 value of x-axis and each of bar is labelled with tag(the value in middle column). The sample bar chart is like:
http://matplotlib.org/examples/pylab_examples/barchart_demo.html
I am using matplotlib.pyplot and np. Thanks..
I followed the tutorial you linked to, but it's a bit tricky to shift them by a nonuniform amount:
import numpy as np
import matplotlib.pyplot as plt
x, label, y = np.genfromtxt('tmp.txt', dtype=int, unpack=True)
ux, uidx, uinv = np.unique(x, return_index=True, return_inverse=True)
max_width = np.bincount(x).max()
bar_width = 1/(max_width + 0.5)
locs = x.astype(float)
shifted = []
for i in range(max_width):
where = np.setdiff1d(uidx + i, shifted)
locs[where[where<len(locs)]] += i*bar_width
shifted = np.concatenate([shifted, where])
plt.bar(locs, y, bar_width)
If you want you can label them with the second column instead of x:
plt.xticks(locs + bar_width/2, label, rotation=-90)
I'll leave doing both of them as an exercise to the reader (mainly because I have no idea how you want them to show up).

Categories

Resources