Python data frame data analytics Group by nth column - python

I am working on a python data analytics.
first. this is the raw data
I want to get a result like
and My code is like
df_sellout.groupby("Brand")[:,0:4].sum()
But this doesn't work.
I want to use [:,0:4] because I have another massive data which I can't write all the columns name.
Can anyone help me please?

Try this;
df_sellout.groupby("Brand")[df_sellout.columns[2:].sum()

Put indexing by iloc in front of groupby:
df_sellout.iloc[:,0:4].groupby("Brand").sum()

Related

Delete index in a DataFrame Pandas

I'm using Python and I have a DataFrama as the following:
I need to delete all the index (columns/rows) in order to get only the name of the columns. How can I do this? please. By the way, I'm using google colab to run the code.
Thanks!
I think you are wanting the first row as headers?
df = df.rename(columns=df.iloc[0]).drop(df.index[0])

List index out of range when specifying column in numpy

I have been tasked with extracting data from a specific column of a csv file using numpy and loadtxt. This data is on column D of the attatched image. By my logic i should use the numpy paramter usecols=3 to only obtain the 4th column which is the one I want. But my output keeps telling me that the index is out of range when there is clearly a column there. I have done some prior searching and the general consensus seems to be that one of the rows doesn't have any data in the column. But i have checked and all the rows have data in the column. Here is the code Im using highlighted in green.Can anyone possibly tell me why this is happening?
data = open("suttonboningtondata_moodle.csv","r")
min_temp = loadtxt(data,usecols=(3),skiprows=5,dtype=str,delimiter=" ")
print(min_temp)
I will suggest you use another library to extract your data. The pandas library works well in this regard.
Here is a documentation link to guide you.
pandas docs
I added a comma instead of whitespace for the delimiter value and it worked. I have no idea why though

Converting 0-1 values in dataset with the name of the column if the value of the cell is 1

I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.

How should I transpose date column using pandas?

I have a dataframe as such:
I wish to transpose it to:
I understand that this might be a basic question, therefore, if someone could direct me to the correct references so I can try to figure out how to do so in pandas.
try with melt() and set_index():
out=(df.melt(id_vars=['Market','Product'],var_name='Date',value_name='Value')
.set_index('Date'))
If needed use:
out.index.name=None
Now If you print out you will get your desired output

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

Categories

Resources