python, dictionary in a data frame, sorting - python

I have a python data frame called wiki, with the wikipedia information for some people.
Each row is a different person, and the columns are : 'name', 'text' and 'word_count'. The information in 'text' has been put in dictionary form (keys,values), to create the information in the column 'word_count'.
If I want to extract the row related to Barack Obama, then:
row = wiki[wiki['name'] == 'Barack Obama']
Now, I would like the most popular word. When I do:
adf=row[['word_count']]
I get another data frame because I see that:
type(adf)=<class 'pandas.core.frame.DataFrame'>
and if I do
adf.values
I get:
array([[ {u'operations': 1, u'represent': 1, u'office': 2, ..., u'began': 1}], dtype=object)
However, what is very confusing to me is that the size is 1
adf.size=1
Therefore, I do not know how to actually extract the keys and values. Things like adf.values[1] do not work
Ultimately, what I need to do is sort the information in word_count so that the most frequent words appear first.
But I would like to understand how to access a the information that is inside a dictionary, inside a data frame... I am lost about the types here. I am not new to programming, but I am relatively new to python.
Any help would be very very much appreciated

If the name column is unique, then you can change the column to the index of the DataFrame object:wiki.set_index("name", inplace=True). Then you can get the value by: wiki.at['Barack Obama', 'word_count'].
With your code:
row = wiki[wiki['name'] == 'Barack Obama']
adf = row[['word_count']]
The first line use a bool array to get the data, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
wiki is a DataFrame object, and row is also a DataFrame object with only one row, if the name column is unique.
The second line get a list of columns from the row, here is the document: http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You get a DataFrame with only one row and one column.
And here is the document of .at[]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.
Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

Find which columns contain a certain value for each row in a dataframe

I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)

Selection in dataframe with array as column value

I have a dataframe filled with twitter data. The columns are:
row_id : Int
content : String
mentions : [String]
value : Int
So for every tweet I have it's row id in the dataframe, the content of the tweet, the mentions used in it (for example: '#foo') as an array of strings and a value that I calculated based on the content of the tweet.
An example of a row would be:
row_id : 12
content : 'Game of Thrones was awful'
mentions : ['#hbo', '#tv', '#dissapointment', '#whatever']
value: -0.71
So what I need is a way to do the following 3 things:
find all rows that contain the mention '#foo' in the mentions-field
find all rows that ONLY contain the mention '#foo' in the mentions-field
above two but checking for an array of strings instead of checking for only one handle
If anyone could help met with this, or even just point me in the right direction that'd be great.
Let's call your DataFrame df.
For the first task you use:
result = df[(Dataframe(df['mentions'].tolist()) == '#foo').any(1)]
Here, the Dataframe(df['mentions']) creates a new DataFrame where each column is a mention and each row a tweet.
Then == '#foo' generates a boolean dataframe containing True where the mentions are '#foo'.
Finally .any(1) returns a boolean index which elements are True if any element in the row is True.
I think with this help you can manage to solve the rest for yourself.

Extract value from single row of pandas DataFrame

I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.
If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.
If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values
Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions
This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]

Categories

Resources