I am writing a script which reads values from a CSV through pandas into a DataFrame. The values, 'A' and 'B', are inputs into an equation. This equation is obtained from an XML output file from an external program. The equation provides a result for 'A' and 'B' row-by-row of the DataFrame and places those results back into the original DataFrame.
If I make a function definition, explicitly write the equation in the definition, and return that that equation, things work fine. e.g.,
import pandas as pd
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
def Func(a,b):
P = 2*a+3*b
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
However, what I really want to do is "build" that same equation from an XML file rather than entering it in explicitly. So, I basically pull 'terms' and 'coefficients' from the xml file and result in a string form of the equation. I then convert the string to an executable function using sympy.sympify(). e.g.,
import pandas as pd
import sympy as sy
import xml.etree.ElementTree as etree
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
tree = etree.parse('C:\...')
.
..(some XML stuff with etree)
.
equationString = "some code that grabs terms and coefficients from XML file" # Builds equation from XML 'terms' and 'coefficients'
P = sy.sympify(equationString)
def Func(A, B):
global P
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
The result is that when I call to execute this equation over the dataFrame the literal equation is copied into the "outPut" DF rather than the row by row result for each 'A' and 'B'. I don't understand why Python sees these code examples differently, nor how to achieve the result I want from the first example. For some reason the sympify() result is not executable. The same seems to occur when I use eval().
Elaborating on my comment, here's how to solve the problem with lambdify
In [1]: import sympy as sp
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [5]: equationString = "2*A+3*B"
In [7]: expr = sp.S(equationString)
In [8]: expr
Out[8]: 2*A + 3*B
In [10]: f = sp.lambdify(sp.symbols("A B"), expr, modules="numpy")
In [11]: f(df['A'],df['B'])
Out[11]:
0 -2.779739
1 -1.176580
2 3.911066
3 1.888639
4 0.745293
dtype: float64
In [12]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[12]:
0 0
1 0
2 0
3 0
4 0
dtype: float64
Depending on the expressions encountered in your xml file, sympy may be overkill. Here's how to use eval
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.ran
np.random np.rank
In [3]: df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
In [4]: equationString = "2*A+3*B"
In [5]: f = eval("lambda A, B: "+equationString)
In [6]: f(df['A'],df['B'])
Out[6]:
0 1.094797
1 -1.942295
2 -5.181502
3 1.888990
4 3.069017
dtype: float64
In [7]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[7]:
0 0
1 0
2 0
3 0
4 0
dtype: float64
Related
enter code hereIn order to change strings' suffix to be prefix in a column of dataframe, which is made with the following code for example.
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
a b
1 100000.ss 10
2 200000.zz 18
I tried one line code below, but the result shows the if else statement doesn't work. Why?
df['a'] = df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") else 'zz.'+x[:6])
a b
1 ss.100000 10
2 ss.200000 18
Each x of your lambda function is a string. x.find returns -1 if not found. -1 is considered as boolean True. Therefore, your lambda always returns ss + .... Try to change your lambda to this
df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") != -1 else 'zz.'+x[:6])
Out[4]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Anyway, you don't need apply for this issue. Just use pandas str accessor
df['a'].str[-2:] + '.' + df['a'].str[:-3]
Out[10]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Why do the hardwork when there is a library that does it for you....
import pandas as pd
from pathlib import Path
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df.assign(
a=lambda x: x["a"].apply(lambda s: f"{Path(s).suffix[1:]}.{Path(s).stem}")
)
output
a b
ss.100000 10
zz.200000 18
There might be options to this in a lower number of lines. I have a solution
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df[['First','Last']] = df.a.str.split(".",expand=True)
df['a']=df['Last']+'.'+df['First']
df.drop(['First','Last'],axis=1)
I recently posted on how to create multiple variables from a CSV file. The code worked in that I have the variables created. However, the code is creating a bunch of variables all equal to the first row. I need the code to make 1 variable for each row in the dataframe
I need 208000 variables labeled A1:A20800
The code I currently have:
df = pandas.read_csv(file_name)
for i in range(1,207999):
for c in df:
exec("%s = %s" % ('A' + str(i), c))
i += 1
I have tried adding additional quotation marks around the second %s (gives a syntax error). I have tried selecting all the rows of the df and using that. Not sure why it isn't working! Every time I print a variable to test if it worked, it is printing the same value, (i.e. A1 = A2 = A3...=A207999) What I actually want is:
A1 = row 1
A2 = row 2
.
.
.
Thank you in advance for any assistance!
I don't know how pandas reads a file, but I'm guessing it returns an iterable. In that case using islice should allow just 20800 rows to be read:
from itertools import islice
df = pandas.read_csv(file_name)
A = list(islice(df, 20800))
# now access rows: A[index]
If you want to create a list containing the values of each row from your DataFrame, you can use the method df.iterrows():
[row[1].to_list() for row in df.iterrows()]
If you still want to create a large number of variables, you can do so in a loop as:
for row in df.iterrows():
list_with_row_values = row[0].to_list()
# create your variables here...
You are getting the same value for all the variables because you are incrementing i in your inner for loop, so all the Annnn variables are probably set to the last value.
So you want something more like:
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[42, 42, 42]})
In [3]: df
Out[3]:
a b
0 1 42
1 2 42
2 3 42
In [28]: for c in df:
...: exec("%s = %s" % ('A' + str(i), c))
...: i += 1
...:
In [29]: A1
Out[29]:
(0L, a 1
b 42
Name: 0, dtype: int64)
In [30]: A1[0]
Out[30]: 0L
In [32]: A1[1]
Out[32]:
a 1
b 42
Name: 0, dtype: int64
Is there a way to replace a masked value in a numpy masked array as a null or None value? This is what I have tried but does not work.
for stars in range(length_masterlist_final):
....
star = customSimbad.query_object(star_names[stars])
#obtain stellar info.
photometry_dataframe.iloc[stars,0] = star_IDs[stars]
photometry_dataframe.iloc[stars,1] = star_names[stars]
photometry_dataframe.iloc[stars,2] = star['FLUX_U'][0]
#Replace "--" masked values with a Null (i.e., '') value.
photometry_dataframe.iloc[stars,2] = ma.filled(photometry_dataframe.iloc[stars,2], fill_value=None)
.....
photometry_dataframe.to_csv(output_dir + "simbad_photometry.csv", index=False, header=True, na_rep='NaN')
specifically
(photometry_dataframe.iloc[stars,2] = ma.filled(photometry_dataframe.iloc[stars,2], fill_value=None))
produces
'MaskedConstant' object has no attribute '_fill_value'
I want to replace masked values '--' with '' when I output the dataframe as a csv file. One work around is to read the outputted csv file back into python and replace '--' with '', but this is a horrible solution. There must be a better solution. I don't want masked values printed as '--' in the csv file.
Use Astropy:
>>> from pandas import DataFrame
>>> from astropy.table import Table
>>> import numpy as np
>>>
>>> df = DataFrame()
>>> df['a'] = [1, np.nan, 2]
>>> df['b'] = [3, 4, np.nan]
>>> df
a b
0 1 3
1 NaN 4
2 2 NaN
>>> t = Table.from_pandas(df)
>>> t
<Table masked=True length=3>
a b
float64 float64
------- -------
1.0 3.0
-- 4.0
2.0 --
>>> t.write('photometry.csv', format='ascii.csv')
>>>
(astropy)neptune$ cat photometry.csv
a,b
1.0,3.0
,4.0
2.0,
You can specify arbitrary transformations from table values to output values using the fill_values parameter (http://docs.astropy.org/en/stable/io/ascii/write.html#parameters-for-write).
My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()
In python, I am trying to find the quickest to hash each value in a pandas data frame.
I know any string can be hashed using:
hash('a string')
But how do I apply this function on each element of a pandas data frame?
This may be a very simple thing to do, but I have just started using python.
Pass the hash function to apply on the str column:
In [37]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df
Out[37]:
a
0 asds
1 asdds
2 asdsadsdas
In [39]:
df['hash'] = df['a'].apply(hash)
df
Out[39]:
a hash
0 asds 4065519673257264805
1 asdds -2144933431774646974
2 asdsadsdas -3091042543719078458
If you want to do this to every element then call applymap:
In [42]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas'],'b':['asewer','werwer','tyutyuty']})
df
Out[42]:
a b
0 asds asewer
1 asdds werwer
2 asdsadsdas tyutyuty
In [43]:
df.applymap(hash)
Out[43]:
a b
0 4065519673257264805 7631381377676870653
1 -2144933431774646974 -6124472830212927118
2 -3091042543719078458 -1784823178011532358
Pandas also has a function to apply a hash function on an array or column:
import pandas as pd
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df["hash"] = pd.util.hash_array(df["a"].to_numpy())
In addition to #EdChum a heads-up: hash() does not return the same values for a string for each run on every machine. Depending on your use-case, you better use
import hashlib
def md5hash(s: str):
return hashlib.md5(s.encode('utf-8')).hexdigest() # or SHA, ...
df['a'].apply(md5hash)
# or
df.applymap(md5hash)