Extract value from single row of pandas DataFrame - python

I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.

If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.

If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values

Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions

This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Get unique years from a date column in pandas DataFrame

I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])

Pandas, accessing every nth element in nested array

I have a dataframe of many rows and 4 columns. Each column contains an array of 100 values.
My intuitive way of doing this is the same way I would do it with multi-dimensional numpy arrays.
For example, I want the first element of every array in column1. So I say
df["column1"][:][0]
To me this makes sense: first select the column, then take every array, then take the first element of every array.
However, it just doesn't work at all. Instead, it simply spits out the entire array from column1, row 1.
But - and this is the most frustrating thing - if I say:
df["column1"][1][0]
It gives me EXACTLY what I expect based on my expected logic, as in, I get the first element in the array in the second row in column1.
How can I get every nth element in every array in column1?
The reason that df["column1"][:][0] isn't doing what you expect is that df["column1"][:] returns a Series. With a Series, using bracket indexing returns the item of the series at that index.
If you want to a series where each item in the series is the item in the corresponding array at that index, the correct solution - whether it seems intuitive or not - is to use .str[...] on the Series.
Instead of
df["column1"][:][0]
use this:
df["column1"].str[0]
It might seem like .str should only be used for actual str values, but a neat trick is that works for lists too.
Here are some ways to do this:
[item[0] for item in df['column1']] # will result in a list
or
df['column1'].apply(lambda item: item[0]) # will result in a series
Not sure if you're looking for a way that's similar to slicing, but AFAIU pandas sees the lists in your table are just arbitrary objects, not something pandas provides a sugar for.
Of course, you can do other fancy things by creating a data frame out of your column:
pd.DataFrame(df['column1'].tolist())
And then do whatever you want with it.

For loop does not stop

for lat,lng,value in zip(location_saopaulo_df['geolocation_lat'], location_saopaulo_df['geolocation_lng'], location_saopaulo_df['municipality']):
coordinates = (lat,lng)
items = rg.search(coordinates)
value = items[0]['admin2']
I am trying to iterate over 3 columns from the dataframe, get the latitude and longitude values from the two columns, use it to get the address then add the city name to the last column I stated which is an empty column consists of NaN values.
However, my for loop is not stopping. I would be grateful if you can tell me why it doesn't stop or better way to do what I'm trying to do.
Thank you in advance.
if rg is reverse_geocoder, there is a better way to query several coordinates at once than looping. try this:
res = rg.search(tuple(zip(location_saopaulo_df['geolocation_lat'],
location_saopaulo_df['geolocation_lng'])))
And then extract just the admin2 value by constructing dataframe for example like:
df_ = pd.Dataframe(res)
and see what it looks like. You may be able to perform a merge or index alignment to put it back into your original dataframe location_saopaulo_df

python, dictionary in a data frame, sorting

I have a python data frame called wiki, with the wikipedia information for some people.
Each row is a different person, and the columns are : 'name', 'text' and 'word_count'. The information in 'text' has been put in dictionary form (keys,values), to create the information in the column 'word_count'.
If I want to extract the row related to Barack Obama, then:
row = wiki[wiki['name'] == 'Barack Obama']
Now, I would like the most popular word. When I do:
adf=row[['word_count']]
I get another data frame because I see that:
type(adf)=<class 'pandas.core.frame.DataFrame'>
and if I do
adf.values
I get:
array([[ {u'operations': 1, u'represent': 1, u'office': 2, ..., u'began': 1}], dtype=object)
However, what is very confusing to me is that the size is 1
adf.size=1
Therefore, I do not know how to actually extract the keys and values. Things like adf.values[1] do not work
Ultimately, what I need to do is sort the information in word_count so that the most frequent words appear first.
But I would like to understand how to access a the information that is inside a dictionary, inside a data frame... I am lost about the types here. I am not new to programming, but I am relatively new to python.
Any help would be very very much appreciated
If the name column is unique, then you can change the column to the index of the DataFrame object:wiki.set_index("name", inplace=True). Then you can get the value by: wiki.at['Barack Obama', 'word_count'].
With your code:
row = wiki[wiki['name'] == 'Barack Obama']
adf = row[['word_count']]
The first line use a bool array to get the data, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
wiki is a DataFrame object, and row is also a DataFrame object with only one row, if the name column is unique.
The second line get a list of columns from the row, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You get a DataFrame with only one row and one column.
And here is the document of .at[]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting

Categories

Resources