How to merge dataset in Python based on column value

How to merge dataset in Python based on column value - python

I have a dataframe structured as follows:
"Location","filePath","startLine","endLine","startColumn","endColumn","codeElementType","description", "codeElement","repository","sha1","url","type","description.1"
An example of the dataframe is the following:
I need to merge the entry that has the same sha1.
An example of output shoud be the following:
Supposing that this is the input:
The expected output should be the following:
This bacause the in this case, the first two lines has the same sha1.
I try the following snippet:
agg_functions=["Location","filePath","startLine","endLine", "startColumn","endColumn","codeElementType","description",
"codeElement","repository","sha1","url","type","description.1"]
df_new = df.groupby(df['sha1']).aggregate(agg_functions)
print(df_new)
However, is always thrown the following expection:
raise AttributeError(
AttributeError: 'SeriesGroupBy' object has no attribute 'Location'
How can I fix it?

agg_functions should be references to functions, not column names.
for example:
agg_functions = [np.sum, "mean"]
see DataFrameGroupBy.aggregate
I can't help with an exact fix. I must confess the text in the images you posted is too small for me to understand what your final result needs to be.

Related

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?

Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]

You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])

What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

Cannot count elements in a column created with string.split which is supposed to be a list

I am working with a pandas df with info about netflix shows. One of the columns was a string with all the countries involved in the production netflix_df['country']. I created an additional column which is intended to be a list of all the individual countries (because there are coproductions involving more than one country) called netflix_df['list_of_countries'] by using the following code:
netflix_df['list_of_countries']=netflix_df['country'].str.split(',')
Afterwards, I attempted to create a new column called netflix_df['number_of_countries'] which included the number of countries in the lists of the column netflix_df[list_of_countries] by doing the following:
netflix_df['number_of_countries'] = [len(c) for c in netflix_df['list_of_countries']]
Nevertheless, I got the following error: TypeError: object of type 'float' has no len()
This doesn't make sense to me, since the column is filled with lists and not floats. What is wrong in my code? I would appreciate some help with this. Thank you very much.

You can get len of each list like below:
netflix_df['number_of_countries'] = netflix_df['list_of_countries'].apply(len)

Getting error trying to plot simple bar chart

I have a dataframe that looks like this:
It looks a little weird because there is a blank space under 'product_id', but it is a dataframe. I tested it with this method.
if isinstance(prod_names, pd.DataFrame):
print('DF')
The DF comes from a count function.
prod_names = pd.DataFrame(df.groupby('product_name')['product_id'].count().sort_values(ascending=False).head(20))
Now, I am trying to plot the results, like this.
pd.value_counts(prod_names['product_name']).plot.bar()
When I run that line of code, I get this error:
KeyError: 'product_name'
When I list the field names in the 'product_names' dataframe
list(prod_names)
I see only: ['product_id']
For some reason, the 'product_name' field is missing. It may have something to do with the space under the 'product_id', but I'm not sure. Thoughts?

Most probably your product name here comes as an index. I don't see any other indexes here. And in this case you cannot access it through column name. You may or reset index(add new numerical column) via prod_names.reset_index() or alternatively just call prod_names.index to see product name info. In first case you can keep your function, in second you can modify it to smth like pd.value_counts(prod_names.index).plot.bar()

Get result of value_count() to excel from Pandas

I have a data frame "df" with a column called "column1". By running the below code:
df.column1.value_counts()
I get the output which contains values in column1 and its frequency. I want this result in the excel. When I try to this by running the below code:
df.column1.value_counts().to_excel("result.xlsx",index=None)
I get the below error:
AttributeError: 'Series' object has no attribute 'to_excel'
How can I accomplish the above task?

You are using index = None, You need the index, its the name of the values.
pd.DataFrame(df.column1.value_counts()).to_excel("result.xlsx")

If go through the documentation Series had no method to_excelit applies only to Dataframe.
So either you can save it another frame and create an excel as:
a=df.column1.value_counts()
a.to_excel("result.xlsx")
Look at Merlin comment I think it is the best way:
pd.DataFrame(df.column1.value_counts()).to_excel("result.xlsx")

Deleting first two rows of a dataframe after doing groupby

I am trying to delete the first two rows from my dataframe df and am using the answer suggested in this post. However, I get the error AttributeError: Cannot access callable attribute 'ix' of 'DataFrameGroupBy' objects, try using the 'apply' method and don't know how to do this with the apply method. I've shown the relevant lines of code below:
df = df.groupby('months_to_maturity')
df = df.ix[2:]
Edit: Sorry, when I mean I want to delete the first two rows, I should have said I want to delete the first two rows associated with each months_to_maturity value.
Thank You

That is what tail(-2) will do. However, groupby.tail does not take a negative value, so it needs a tweak:
df.groupby('months_to_maturity').apply(lambda x: x.tail(-2))
This will give you desired dataframe but its index is a multi-index now.
If you just want to drop the rows in df, just use drop like this:
df.drop(df.groupby('months_to_maturity').head(2).index)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge dataset in Python based on column value - python

agg_functions should be references to functions, not column names. for example: agg_functions = [np.sum, "mean"] see DataFrameGroupBy.aggregate I can't help with an exact fix. I must confess the text in the images you posted is too small for me to understand what your final result needs to be.

Related

.strip() with in-place solution not working

Cannot count elements in a column created with string.split which is supposed to be a list

Getting error trying to plot simple bar chart

Get result of value_count() to excel from Pandas

Deleting first two rows of a dataframe after doing groupby

Categories

Resources