Create nested list from Pandas dataframe - python

i have a simple pandas dataframe with two columns. i would like to generate a nested list of those two columns.
geo = pd.DataFrame({'lat': [40.672304, 40.777169, 40.712196],
'lon': [-73.935385, -73.988911, -73.957649]})
my solution to this problem is the following:
X = [[i] for i in geo['lat'].tolist()]
Y = [i for i in geo['lon'].tolist()]
for key, value in enumerate(X):
X[key].append(Y[key])
however, i feel there must be a better way than this.
thanks!

pandas is built on top of numpy. A DataFrame stores its values in a numpy array, which has a tolist method.
>>> geo = pd.DataFrame({'lat': [40.672304, 40.777169, 40.712196],
...: 'lon': [-73.935385, -73.988911, -73.957649]})
...:
>>> geo.values
>>>
array([[ 40.672304, -73.935385],
[ 40.777169, -73.988911],
[ 40.712196, -73.957649]])
>>>
>>> geo.values.tolist()
[[40.672304, -73.935385], [40.777169, -73.988911], [40.712196, -73.957649]]

How about
out_list = []
for index, row in geo.iterrows():
out_list.append([row.lat, row.long])

Related

efficient way of computing a dataframe using concat and split

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?
I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

pandas column calculated using function including dict lookup, 'Series' objects are mutable, thus they cannot be hashed

I am aware there are tons of questions similar to mine, but I could not find the solution to my question in the last 30 Minutes of looking through dozens of threads.
I have a dataframe with hundereds of columns and rows, and use most columns within a function to return a value thats supposed to be added to an additional column.
The problem can be broken down to the following.
lookup = {"foo": 1, "bar": 0}
def lookuptable(input_string, input_factor):
return lookup[input_string] * input_factor
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","faktor"])
mydata["looked up value"] = lookuptable(mydata["string"], mydata["faktor"])
But this returns:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a way to avoid this problem without, restructuring the function itself?
Thanks in advance!
Try this:
lookup = {"foo": 1, "bar": 0}
def lookuptable(data):
return lookup[data["string"]] * data["faktor"]
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","faktor"])
mydata["looked up value"] = mydata.apply(lookuptable, axis=1)
print(mydata)
string faktor looked up value
0 foo 4 4
1 bar 3 0
Besides of using .apply(), you can use list comprehension with .iterrows()
mydata["looked up value"] = [lookuptable(row[1]["string"], row[1]["faktor"]) for row in mydata.iterrows()]
Your functions accepts 2 parameters, a string and a integer.
But you provide 2 pandas series to the function instead. You can iterate through the dataframe however and provide the function with the parameters (row-wise) by using .apply().
mydata["looked up value"] = mydata\
.apply(lambda row: lookuptable(row["string"], row["faktor"]), axis=1)
You can do this without function -
import pandas as pd
lookup = {"foo": 1, "bar": 0}
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","factor"])
mydata["looked up value"] = mydata['string'].map(lookup) * mydata['factor']

Numpy get maximum value based on XYZ

I'm trying to read an CSV file with some XYZ data but when gridding using Python Natgrid is causing an error: two input triples have the same x/y coordinates. Here is my array:
np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I want to remove XY duplicates and get the maximum Z value. Based on the example above, I want to remove any minimum values of this array:
np.array([[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I have tried using np.unique, but so far I haven't had any luck because it doesn't work with rows (only columns).
Here is a numpy way, sorting first by Z, then finding the first of each unique X and Y pair, and indexing:
a = np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
# sort by Z
b = a[np.argsort(a[:,2])[::-1]]
# get first index for each unique x,y pair
u = np.unique(b[:,:2],return_index=True,axis=0)[1]
# index
c = b[u]
>>> c
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])
If you are able to use pandas, you can take advantage of groupby and max
>>> pandas.DataFrame(arr).groupby([0,1], as_index=False).max().values
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])
You can use Pandas via sorting and dropping duplicates:
import pandas as pd
df = pd.DataFrame(arr)
res = df.sort_values(2, ascending=False)\
.drop_duplicates([0, 1])\
.sort_values(0).values
print(res)
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])

Creating list of set values from a single value containing multiple value sets under one parenthesis

so I currently have a column containing values like this:
d = {'col1': [LINESTRING(174.76028 -36.80417,174.76041 -36.80389, 175.76232 -36.82345)]
df = pd.DataFrame(d)
and I am trying to make it so that I can:
1) apply a function to each of the numerical values and
2) end up with something like this.
d = {'col1': [LINESTRING], 'col2': [(174.76028, -36.80417),(174.76041 -36.80389), (175.76232 -36.82345)]
df = pd.DataFrame(d)
Any thoughts?
Thanks
Here is one way. Note that LineString accepts an ordered collection of tuples as an input. See the docs for more information.
We use operator.attrgetter to access the required attributes: coords and __class__.__name__.
import pandas as pd
from operator import attrgetter
class LineString():
def __init__(self, list_of_coords):
self.coords = list_of_coords
pass
df = pd.DataFrame({'col1': [LineString([(174.76028, -36.80417), (174.76041, -36.80389), (175.76232, -36.82345)])]})
df['col2'] = df['col1'].apply(attrgetter('coords'))
df['col1'] = df['col1'].apply(attrgetter('__class__')).apply(attrgetter('__name__'))
print(df)
col1 col2
0 LineString [(174.76028, -36.80417), (174.76041, -36.80389...

Creating Dataframe with numpy array with index and columns [duplicate]

I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:
data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])
I'd like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values
I can specify the index as follows:
df = pd.DataFrame(data,index=data[:,0]),
however I am unsure how to best assign column headers.
You need to specify data, index and columns to DataFrame constructor, as in:
>>> pd.DataFrame(data=data[1:,1:], # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names
edit: as in the #joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.
Here is an easy to understand solution
import numpy as np
import pandas as pd
# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
[6. , 2.2]])
# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
Column1 Column2
0 5.8 2.8
1 6.0 2.2
I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying "option 2" from this great answer, you could do it like this:
import pandas
import numpy
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
df = pandas.DataFrame(values, index=index)
This can be done simply by using from_records of pandas DataFrame
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)
>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)
>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],
... index=[i for i in range(data.shape[0])],
... columns=['f'+str(i) for i in range(data.shape[1])])
>>df.head()
[![array to dataframe][1]][1]
Here simple example to create pandas dataframe by using numpy array.
import numpy as np
import pandas as pd
# create an array
var1 = np.arange(start=1, stop=21, step=1).reshape(-1)
var2 = np.random.rand(20,1).reshape(-1)
print(var1.shape)
print(var2.shape)
dataset = pd.DataFrame()
dataset['col1'] = var1
dataset['col2'] = var2
dataset.head()
Adding to #behzad.nouri 's answer - we can create a helper routine to handle this common scenario:
def csvDf(dat,**kwargs):
from numpy import array
data = array(dat)
if data is None or len(data)==0 or len(data[0])==0:
return None
else:
return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
Let's try it out:
data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)
In [61]: csvDf(data)
Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc
I think this is a simple and intuitive method:
data = np.array([[0, 0], [0, 1] , [1, 0] , [1, 1]])
reward = np.array([1,0,1,0])
dataset = pd.DataFrame()
dataset['StateAttributes'] = data.tolist()
dataset['reward'] = reward.tolist()
dataset
returns:
But there are performance implications detailed here:
How to set the value of a pandas column as list
It's not so short, but maybe can help you.
Creating Array
import numpy as np
import pandas as pd
data = np.array([['col1', 'col2'], [4.8, 2.8], [7.0, 1.2]])
>>> data
array([['col1', 'col2'],
['4.8', '2.8'],
['7.0', '1.2']], dtype='<U4')
Creating data frame
df = pd.DataFrame(i for i in data).transpose()
df.drop(0, axis=1, inplace=True)
df.columns = data[0]
df
>>> df
col1 col2
0 4.8 7.0
1 2.8 1.2

Categories

Resources