How to access pandas data from a table - python

I am trying to read data using pandas.
Here is what I have tried:
df = pd.read_csv("samples_data.csv")
in_x = df.for_x
in_y = df.for_y
in_init = df.Init
plt.plot(in_x[0], in_y[0], 'b-')
The problem is that, in_x and in_y output a string: (0, '[5 3 9 4.8 2]') (1, '[6 3 9 4.8 2]') ... How could I solve the problem ?
Thank you for taking the time to answer my question.
I was expecting :
in_x_1 = in_x[2][0] # output: [
in_x_2 = in_x[2][1] # output: 6

Read in dataframe, and slice with the iloc method:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([
[[5,3,9,4.8,2], [5,3,9,4.8,9], 33],
[[6,3,9,4.8,2], [4,3.8,9,8,4], 87],
[[6.08,2.89,9,4.8,2], [8,3,9,4,7.34], 93],
],
columns=["for_x", "for_y", "Init"]
)
print(df)
in_x = df.for_x.iloc[0]
in_y = df.for_y.iloc[0]
plt.plot(in_x, in_y, 'b-')
plt.show()
Printing the dataframe:
for_x for_y Init
0 [5, 3, 9, 4.8, 2] [5, 3, 9, 4.8, 9] 33
1 [6, 3, 9, 4.8, 2] [4, 3.8, 9, 8, 4] 87
2 [6.08, 2.89, 9, 4.8, 2] [8, 3, 9, 4, 7.34] 93
If your dataframe has string entries, the eval function will turn them into lists which you can then plot data from:
df_2 = pd.DataFrame([
['[5,3,9,4.8,2]', '[5,3,9,4.8,9]', 33],
['[6,3,9,4.8,2]', '[4,3.8,9,8,4]', 87],
['[6.08,2.89,9,4.8,2]', '[8,3,9,4,7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
in_x = eval(df_2.for_x.iloc[0])
in_y = eval(df_2.for_y.iloc[0])
If your values are not comma separated:
df_3 = pd.DataFrame([
['[5 3 9 4.8 2]', '[5 3 9 4.8 9]', 33],
['[6 3 9 4.8 2]', '[4 3.8 9 8 4]', 87],
['[6.08 2.89 9 4.8 2]', '[8 3 9 4 7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
string_of_nums_x = df_3.for_x.iloc[0].strip('[').strip(']')
in_x = [float(s) for s in string_of_nums_x.split()]
string_of_nums_y = df_3.for_y.iloc[0].strip('[').strip(']')
in_y = [float(s) for s in string_of_nums_y.split()]
Plotting:

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')

Make multiple means from one column with pandas

With Python 3.10
Sample data:
import pandas as pd
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
print(df)
df = df.mask(df == 0).fillna(df.mean())
print(df) # <---this works but you will see what I mean about looking off..
Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).
The idea is I have one or more column(s) with bad or missing data.
If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.
This would be more helpful if there were a data set that irregular or had missing bad data
You can take a rolling mean with min periods = 1 to smooth out the data.
or you can do a variant of this method to customise what you want.
Inside the lambda i used this np.nanmean(x).
import pandas as pd
import numpy as np
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
print(df)
result:
column data something ma final
0 1 14890.0 3 14890.000000 14890.0
1 4 5.0 6 7447.500000 5.0
2 7 8.0 9 4967.666667 8.0
3 11 13.0 14 8.666667 13.0
4 12 0.0 18 7.000000 0.0
5 87 NaN 54 6.500000 6.5
6 1 0.0 3 0.000000 0.0
7 4 5.0 6 2.500000 5.0
8 7 8.0 9 4.333333 8.0
9 11 13.0 14 8.666667 13.0
10 12 0.0 18 7.000000 0.0
11 87 NaN 54 6.500000 6.5
12 1 0.0 3 0.000000 0.0
13 4 5.0 6 2.500000 5.0
14 7 8.0 9 4.333333 8.0
15 11 13.0 14 8.666667 13.0
16 12 0.0 18 7.000000 0.0
17 87 10026.0 54 3346.333333 10026.0

Find nearest index in one dataframe to another

I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)

How to get numpy array from multiple lists of same length and sort along an axis?

I have a very simple question ,How to get numpy array from multiple lists of same length and sort along an axis ?
I'm looking for something like:
a = [1,1,2,3,4,5,6]
b = [10,10,11,09,22,20,20]
c = [100,100,111,090,220,200,200]
d = np.asarray(a,b,c)
print d
>>>[[1,10,100],[1,10,100],[2,11,111].........[6,20,200]]
2nd Question : And if this could be achieved can i sort it along an axis (for eg. on the values of List b)?
3rd Question : Can the sorting be done over a range ? for eg. for values between b+10 and b-10 while looking at List c for further sorting. like
[[1,11,111][1,10,122][1,09,126][1,11,154][1,11,191]
[1,20,110][1,25,122][1,21,154][1,21,155][1,21,184]]
You can zip to get the array:
a = [1, 1, 2, 3, 4, 5, 6]
b = [10, 10, 11, 9, 22, 20, 20]
c = [100, 100, 111, 90, 220, 200, 200]
d = np.asarray(zip(a,b,c))
print(d)
[[ 1 10 100]
[ 1 10 100]
[ 2 11 111]
[ 3 9 90]
[ 4 22 220]
[ 5 20 200]
[ 6 20 200]]
print(d[np.argsort(d[:, 1])]) # a sorted copy
[[ 3 9 90]
[ 1 10 100]
[ 1 10 100]
[ 2 11 111]
[ 5 20 200]
[ 6 20 200]
[ 4 22 220]]
I don't know how you would do an inplace sort without doing something like:
d = np.asarray(zip(a,b,c))
d.dtype = [("0", int), ("1", int), ("2", int)]
d.shape = d.size
d.sort(order="1")
The leading 0 would make the 090 octal in python2 or invalid syntax in python3 so I removed it.
You can also sort the zipped elements before you pass the:
from operator import itemgetter
zipped = sorted(zip(a,b,c),key=itemgetter(1))
d = np.asarray(zipped)
print(d)
[[ 3 9 90]
[ 1 10 100]
[ 1 10 100]
[ 2 11 111]
[ 5 20 200]
[ 6 20 200]
[ 4 22 220]]
You can use np.dstack and np.lexsort . for example if you want to sort based on the array b(second axis) then a and then c :
>>> d=np.dstack((a,b,c))[0]
>>> indices=np.lexsort((d[:,1],d[:,0],d[:,2]))
>>> d[indices]
array([[ 3, 9, 90],
[ 1, 10, 100],
[ 1, 10, 100],
[ 2, 11, 111],
[ 5, 20, 200],
[ 6, 20, 200],
[ 4, 22, 220]])

Python Numpy arange bins should be shown in ascending order

I have created a series of bins using the Numpy 'arange' function:
bins = np.arange(0, df['eCPM'].max(), 0.1)
The output looks like this:
[1.8, 1.9) 145940.67 52.569295 1.842306
[1.9, 2) 150356.59 54.159954 1.932365
[10.6, 10.7) 150980.84 54.384815 10.626436
[13.3, 13.4) 152038.63 54.765842 13.373157
[2, 2.1) 171494.11 61.773901 2.033192
[2.1, 2.2) 178196.65 64.188223 2.141412
[2.2, 2.3) 186259.13 67.092410 2.264005
How can I get the bins[10. 6, 10.7] and [13.3, 13.4] to go where they belong such that all bins appear in ascending order?
I'm assuming the bins are read as strings hence this issue. I tried to add a dtype: bins = ..., 0.1, dtype=float) but no luck.
[EDIT]
import numpy as np
import pandas
df = pandas.read_csv('path/to/file', skip_footer=1)
bins = np.arange(0, df1['eCPM'].max(), 0.1, dtype=float)
df['ecpm group'] = pandas.cut(df['eCPM'], bins, right=False, labels=None)
df =df[['ecpm group', 'Imps', 'Revenue']].groupby('ecpm group').sum()
You could sort the index in "human order" and then reindex:
import numpy as np
import pandas as pd
import re
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
def atoi(text):
return int(text) if text.isdigit() else text
return [atoi(c) for c in re.split('(\d+)', text)]
# df = pandas.read_csv('path/to/file', skip_footer=1)
df = pd.DataFrame({'eCPM': np.random.randint(20, size=40)})
bins = np.arange(0, df['eCPM'].max()+1, 0.1, dtype=float)
df['ecpm group'] = pd.cut(df['eCPM'], bins, right=False, labels=None)
df = df.groupby('ecpm group').sum()
df = df.reindex(index=sorted(df.index, key=natural_keys))
print(df)
yields
eCPM
[0, 0.1) 0
[1, 1.1) 5
[2, 2.1) 4
[4, 4.1) 12
[6, 6.1) 24
[7, 7.1) 7
[8, 8.1) 16
[9, 9.1) 45
[10, 10.1) 40
[11, 11.1) 11
[12, 12.1) 12
[13, 13.1) 13
[15, 15.1) 15
[16, 16.1) 64
[17, 17.1) 34
[18, 18.1) 18

Categories

Resources