Numpy genfromtxt Column Names - python

How can I have genfromtxt to return me its list of column names which were automatically retrieved by names=True? When I do:
data = np.genfromtxt("test.csv",names=True,delimiter=",",dtype=None)
print data['col1']
it prints the entire column values for col1.
However, I need to traverse all column names. How can I do that?
I tried data.keys() and various other methods, but whatever is returned by genfromtxt does not seem to be a dictionary compatible object. I guess I could pass the list of column names myself, but this won't be maintainable for me in the long run.
Any ideas?

genfromtxtreturns a numpy.ndarray.
You can get the data type with
data.dtype
or just the names with
data.dtype.names
which is a tuple you can iterate over and access the columns as you want to.

Related

How do I un-nest multiple sets of data and set them as columns

I am trying to bring this data from an api into a data frame however some of the data is nested under another one. So for example I would like to have each stat like 'passing_att' as its own column. But it is nested under 'passing' which 'passing' and 'rushing' and 'receiving' is all nested under 'stats'. I can bring just 'passing' into a data frame which would show 'passing_att' and the rest of the 'passing' data but then I don't have names and all the other data I would like.
Code
The correct word is to 'flatten' a JSON object. There are many ways to do this. You could use the flatten_json pip package or write your own function to do this. Though for your specific use case, you could use the json_normalize() function from the pandas library because you are trying to flatten a list of JSON objects that can easily be converted into a pandas dataframe.
Once you have the nested JSON object as show below (performed in your code):
json_list = json.loads(response.text)
Use the following code to flatten it:
df = pandas.json_normalize(json_list)
You will get all the nested columns flattened into a single level dataframe separated by a period. For example, the passing_att attribute cann be accessed via the column name 'stats.passing.passing_att'. You can change the separator by passing the sep parameter to the json_normalize function. For example, if you want to use a hyphen as the separator, you can do so by passing the following parameter to the json_normalize function:
df = pandas.json_normalize(json_list, sep='-')
If you want to select only a few columns, you can pass a list of column names to the dataframe's [] operator. For example, if you want to select only the player_name, position, passing_att and passing_cmp columns, you can do so by using the following code:
df = df[['player_name', 'position', 'stats.passing.passing_att', 'stats.passing.passing_cmp']]

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Mutable indexed heterogeneous data structure?

Is there a data class or type in Python that matches these criteria?
I am trying to build an object that looks something like this:
ExperimentData
ID 1
sample_info_1: character string
sample_info_2: character string
Dataframe_1: pandas data frame
Dataframe_2: pandas data frame
ID 2
(etc.)
Right now, I am using a dict to hold the object ('ExperimentData'), which containsnamedtuple's for each ID. Each of the namedtuple's has a named field for the corresponding data attached to the sample. This allows me to keep all the ID's indexed, and have all of the fields under each ID indexed as well.
However, I need to update and/or replace the entries under each ID during downstream analysis. Since a tuple is immutable, this does not seem to be possible.
Is there a better implementation of this?
You could use a dict of dicts instead of a dict of namedtuples. Dicts are mutable, so you'll be able to modify the inner dicts.
Given what you said in the comments about the structures of each DataFrame-1 and -2 being comparable, you could also group all of each into one big DataFrame, by adding a column to each DataFrame containing the value of sample_info_1 repeated across all rows, and likewise for sample_info_2. Then you could concat all the DataFrame-1s into a big one, and likewise for the DataFrame-2s, getting all your data into two DataFrames. (Depending on the structure of those DataFrames, you could even join them into one.)

what effect does changing the datatype in a pandas dataframe or series have?

Specifically,
If I don't need to change the datatype, is it better left alone? Does it copy the whole column of a dataframe? Does it copy the whole dataframe? Or does it just alter some setting in the dataframe to treat the entries in that column as a particular type?
Also, is there a way to set the type of the columns while the dataframe is getting created?
Here is one example "2014-05-25 12:14:01.929000" is cast as a np.datetime64 when the dataframe is created. then I save the dataframe onto a csv. then I read from the csv, and it becomes an arbitrary object. How would I avoid this? Or how can I re-cast this particular column as an np.datetime64 whilst doing a pd.DataFrame.read_csv ....
Thanks.

Categories

Resources