I have a not too large DF. I want to add a column that looks up the value in the column of that specific row. So in the example below, the value should come from the column names 'PA1.13'
example = {'Honda Civic': [1],
'Toyota': [0],
'valuetolookup': ['Honda Civic'],
'Result should be': [1]
}
As you can see the column has two levels. I cannot seem to find how to make a second column level from scratch, but here I hope that I can work it out if someone wants to use my example code to solve it :-)
You can use a simple apply() to extract data like you want:
import pandas as pd
example = {'Honda Civic': [1,3],
'Toyota': [0,2],
'valuetolookup': ['Honda Civic','Toyota'],
'Result should be': [1,2]
}
df = pd.DataFrame(example)
#In the pandas apply, i use the "valuetolookup" column value to get the column name
df["Result"] = df.apply(lambda x : x[x["valuetolookup"]],axis=1)
I added another row to show you that you can use different columns to lookup :)
Related
I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.
This is a basic question so apologies in advance.
I am using Pandas and I am grouping data with the following line:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()['keyword']
This is referencing the following:
The data frame: [page_serp_df]
Grouping by the column: meta_keywords_1_length
Counting with the filter: keyword column
What I don't understand is why does the filtering condition have to be ['keyword'] i.e. a string in quotes?
For example, this doesn't work and it is very counterintuituve to me:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()[page_serp_df.keyword]
Thanks in advance!
I think there is a misunderstanding on what the .count() method returns.
Try to follow this example:
Create a sample data frame
df = pd.DataFrame({
'A':[0,1,0,1, 1],
'B':[100,200,300, 400, 500],
'C': [1,2,3,4,5]
})
This is what the count() method will return after groupby
# similarly to your example I am grouping by A and counting
df.groupby([df.A]).count()
As you can see, the count() method returns a dataframe itself, having the count of each other column values for the column where the grouped column has the same value.
After that, you can query for a specific column form the return of count() like this
df.groupby([df.A]).count()['C']
But the second case in your example, which in my example would correspond to df.groupby([df.A]).count()[df.C]
Will throw an error!
In fact, you would query a dataframe (in this case df.groupby([df.A]).count()) via a pandas Series but as you know you need a string or a column from df.columns.
You can check yourself that df.C and 'C' are two very different variable types.
print(type(df.C))
print(type('C'))
# <class 'pandas.core.series.Series'>
# <class 'str'>
If for some reason your code still works with the equivalent of df.C there might be some contingency like the only value of the df.C is a string with the same name of a column.. or something unintentional like that.
I created a pandas dataframe from a dictionary like this:
dictionary={'cat': [B1, B2,B3,B4,B5,B6,B7,B8,B9,B10], 'Dog': [c1, c2,c3], 'Bird': [d1,d2,d3,d4,d5]}
df = pd.DataFrame(dictionary.items(), columns=['ID_1','ID_match'])
But I get a table looking like this:
And I would like to be this way:
So far I did this way:
df_2_1=df .replace('', np.nan).set_index('ID_1').stack().reset_index(name='ID_match').drop('level_1',1)
But I get the second value as list...
Can someone point me in the right direction?
Solution:
I just needed to expand the second column:
df.explode('ID_match')
This solution should work. The first .iloc is taking every other starting with the first column, and the second is taking every other starting with the second column.
df1 = df.iloc[:,::2].melt()
df1 = df1['variable']
df2 = df.iloc[:,1::2].melt()
df2 = df2['value']
df3 = pd.DataFrame({'col1':df1, 'col2':df2})