Groupby to Dataframe in Pandas

Groupby to Dataframe in Pandas - python

Basically, i use an excel file that contains thousands of data and I'm using pandas to read in the file.
import pandas as pd
agg = pd.read_csv('Station.csv', sep = ',')
Then what i did was i grouped the data accordingly to these categories,
month_station = agg.groupby(['month','StationName'])
the groupby will not be used for counting the mean, median or etc but just aggregating the data in terms of month and station name. it's what the question wants
Now, I would want to output the month_station into an excel file so first i would need to transfer the groupby into the dataframe.
I've seen examples:
pd.DataFrame(month_station.size().reset_index(name = "Group_Count"))
but the thing is, i don't require the size/count of my data but just grouping it in terms of month and station name which does not require count or sorts. I tried removing the size() and it gives me an error.
I just want the content of month_station to be ported into a dataframe so i could proceed and output as a csv file but it seemed complicated.

The nature of groupby is so that you can derive an aggregate calculation, such as mean or count or sum or etc. If you are merely trying to see on of each pair of month and station name, try this:
month_station = agg.groupby(['month','StationName'],as_index=False).count()
month_station = month_station['month','StationName']

Related

How to plot graph for two csv

I have two csv which have 5 columns id product name normal price discount price card price and discount both csv have different prices from different timelines so i am looking to plot a graph because i want to see the behavior of the prices for each product, both have products that are repeated, so I would like those that are repeated to appear in the graph.
I tried converting to a data frame but no success.
This is how it look the csv file

It would be helpful to your second .csv file too.
From this answer.
Assuming both your files have the same column names, use Pandas' pandas.concat() method to make one dataframe from a list of dataframes.
Then, simply plot a graph of your new dataframe as usual.
import pandas as pd
listOfDFs = []
df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('csv2.csv')
listOfDFs.append(df1)
listOfDFs.append(df2)
dfBoth = pd.concat(listOfDFs, axis=0, ignore_index=True)
dfBoth.plot()

Split per attribute

I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?

pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)

Sum Cells with Same Date

I am a complete noob at this Python and Jupiter Notebook stuff. I am taking an Intro to Python Course and have been assigned a task to do. This is to extract information from a .csv file. The following is a snapshot of my .csv file titled "feeds1.csv"
https://i.imgur.com/BlknyC3.png
I can import the .csv into Jupyter Notebook, and have tried groupby function to sort it. But it won't work due to the fact that column also has time in it.
import pandas as pd
df = pd.read_csv("feeds1.csv")
I need it to output as follows:
https://i.imgur.com/BDfnZrZ.png
The ultimate goal would be to create a csv file with this accumulated data and use it to plot a chart,

If you do not need the time of day but just the date, you can simply use this:
df.created_at = df.created_at.str.split(' ').str[0]
dfout = df.groupby(['created_at']).count()
dfout.reset_index(level=0, inplace=True)
finaldf = dfout[['created_at', 'entry_id']]
finaldf.columns = ['Date', 'field2']
finaldf.to_csv('outputfile.csv', index=False)
The first line will split the created_at column at the space between the date and time. The .str[0] means it will only keep the first part of the split (which is the date).
The second line groups them by date and gives you the count.
When writing to csv, if you do not want the index to show (as in your pic), then use index=False. If you want the index, then just leave that portion out.

First you need to parse your date right:
df["date_string"] = df["created_at"].str.split(" ").str[0]
df["date_time"] = pd.to_datetime(df["date_string"])
# You can chose to drop earlier columns
# Now you just want to groupby with the date and apply the aggregation/function you want to
df = df.groupby(["date_time"]).sum("field2").reset_index() # for example
df.to_csv("abc.csv", index=False)

Custom function applied to dataframe, based on value in id column

I've got a dataframe that contains several columns, including a user ID (id) and a timestamp (startTime). I want to check how many different days my data (df rows) span, per user.
I'm currently doing that by splitting up the df by 'id', and then calculating the following in a loop for each of the subset dfs:
days = len(df.startTime.dt.date.unique())
How do I do this more efficiently, without splitting up the data frame? I'm working with rather large data frames, and I fear this will take way too much time. I've looked at the groupby function, but I didn't get far. I tried something like:
result = df.groupby('id').agg({'days': lambda x: x.startTime.dt.date.unique()})
... but that clearly didn't work.

You can using drop_duplicates before value_counts
df['New Date'] = df['startTime'].dt.date
result = df.drop_duplicates(['ID','New Date']).ID.value_counts()

Python Pandas Index Sorting/Grouping/DateTime

I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you

Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.