I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you
Related
Every row of my dataframe contain a record with a unique key combination. The data validation will be based on the columns and on key combination. For example, in a single column, cells may have a different min/max requirement based on the key combination.
Several questions:
can Pandera validate on a cell basis as opposed to column basis ?
does Pandera have a schema generator capable of this type of flexibility. Perhaps it scans a "golden dataframe" as a starting place to create a schema based on some provided criteria. I realize the schema generator output may need a bit of tweaking.
The library does look cool, and I am interested to pursue further.
thanks
so you can create a validator that validates a single value at a time with the element_size=True kwarg, you can read more here.
import pandera as pa
check = pa.Check(lambda x: 0 <= x <= 100, element_wise=True)
The function must take an individual value as input and output a boolean.
Can you elaborate on the exact check that you want to perform? If you want to do a dataframe-level row-wise check you can use an element-wise check at the dataframe-level as a wide check.
does Pandera have a schema generator capable of this type of flexibility. Perhaps it scans a "golden dataframe" as a starting place to create a schema based on some provided criteria. I realize the schema generator output may need a bit of tweaking.
You can use the schema = pandera.infer_schema(golden_dataframe) function to bootstrap a starter schema, then write it out to a file with schema.to_script("path/to/file") to further iterate.
I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.
I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.
read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.
If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.
I am loading a text file into pandas, and have a field that contains year. I want to make sure that this field is a string when pulled into the dataframe.
I can only seem to get this to work if I specify the exact length of the string using the code below:
df = pd.read_table('myfile.tsv', dtype={'year':'S4'})
Is there a way to do this without specifying length? I will need to perform this action on different columns that vary in length.
I believe we enabled in 0.12
you can pass str,np.str_,object in place of an S4
which all convert to object dtype in any event
or after you read it in
df['year'].astype(object)