Mutable indexed heterogeneous data structure? - python

Is there a data class or type in Python that matches these criteria?
I am trying to build an object that looks something like this:
ExperimentData
ID 1
sample_info_1: character string
sample_info_2: character string
Dataframe_1: pandas data frame
Dataframe_2: pandas data frame
ID 2
(etc.)
Right now, I am using a dict to hold the object ('ExperimentData'), which containsnamedtuple's for each ID. Each of the namedtuple's has a named field for the corresponding data attached to the sample. This allows me to keep all the ID's indexed, and have all of the fields under each ID indexed as well.
However, I need to update and/or replace the entries under each ID during downstream analysis. Since a tuple is immutable, this does not seem to be possible.
Is there a better implementation of this?

You could use a dict of dicts instead of a dict of namedtuples. Dicts are mutable, so you'll be able to modify the inner dicts.
Given what you said in the comments about the structures of each DataFrame-1 and -2 being comparable, you could also group all of each into one big DataFrame, by adding a column to each DataFrame containing the value of sample_info_1 repeated across all rows, and likewise for sample_info_2. Then you could concat all the DataFrame-1s into a big one, and likewise for the DataFrame-2s, getting all your data into two DataFrames. (Depending on the structure of those DataFrames, you could even join them into one.)

Related

Extracting data from a dict like column in a dataframe

I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.
As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.

fastest way to copy values from one cell of a dataframe to another data frame if a third cell matches

I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue
I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')

Size immutability in pandas data structure

While going through pandas Documentation for version 0.24.1 here, I came across this statement.
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
import pandas as pd
test_s = pd.Series([1,2,3])
id(test_s) # output: 140485359734400 (will vary)
len(test_s) # output: 3
test_s[3] = 37
id(test_s) # output: 140485359734400
len(test_s) # output: 4
The meaning of size immutable as per my inference is that operations like appending and deleting an element are not allowed, which is clearly not the case. Even the identity of the object remains the same, ruling out the possibility of a new object creation with the same name.
So, what does size immutability actually mean?
Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.
#dbot_5
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.
A per my opinion, it is already written that all pandas data structures (Series, Dataframes) are value-mutable. It means we can add or delete values in Series and DataFrame.
"The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
As given in this statement, we cannot change the columns in the Series (by default, it has only one column and we cannot add new columns and we cannot even delete it.) but we can change- add, delete columns in a DataFrame. so here length means no. of columns not the number of values.

PySpark: Exhaustive list of data types

I am trying to define a function in Python Spark that can tell me which columns are to be considered as numeric (continuous) and which should be considered as categorical columns. While doing this I'm accessing the dtypes of the dataframe and iterating through each of the variable to check if its a member of continuous_types or categorical_types(defined below). continuous_types and categorical_types are lists and these are their entries-
continuous_types = ('double', 'bigint')
categorical_types = ('string')
I think there are more strings/dtypes that should be a part of both these lists, especially continuous_types. I got these dtypes by creating and reading datatsets and checking their dtypes. Are these three exhaustive?
I looked up this link but I couldn't get the required information.
In short, what is the exhaustive list of values I can expect when I access the dtypes attribute of a spark dataframe
You find the available types there :
pyspark.sql.types

Numpy genfromtxt Column Names

How can I have genfromtxt to return me its list of column names which were automatically retrieved by names=True? When I do:
data = np.genfromtxt("test.csv",names=True,delimiter=",",dtype=None)
print data['col1']
it prints the entire column values for col1.
However, I need to traverse all column names. How can I do that?
I tried data.keys() and various other methods, but whatever is returned by genfromtxt does not seem to be a dictionary compatible object. I guess I could pass the list of column names myself, but this won't be maintainable for me in the long run.
Any ideas?
genfromtxtreturns a numpy.ndarray.
You can get the data type with
data.dtype
or just the names with
data.dtype.names
which is a tuple you can iterate over and access the columns as you want to.

Categories

Resources