I have a pandas DataFrame df with two variables in it: a string variable str1 and a floating point numerical variable fp1. I group that DataFrame like so:
dfg = df.groupby(pandas.qcut(df['fp1'],4,labels=['g1','g2','g3','g4']))
I want to write out the grouped results to a csv file. When I try:
dfg.to_csv('dfg.csv')
the csv file contains observations only for group g4. How can I get the to_csv method to write out the whole grouped DataFrame dfg?
Related
I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?
pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)
I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers
Basically, i use an excel file that contains thousands of data and I'm using pandas to read in the file.
import pandas as pd
agg = pd.read_csv('Station.csv', sep = ',')
Then what i did was i grouped the data accordingly to these categories,
month_station = agg.groupby(['month','StationName'])
the groupby will not be used for counting the mean, median or etc but just aggregating the data in terms of month and station name. it's what the question wants
Now, I would want to output the month_station into an excel file so first i would need to transfer the groupby into the dataframe.
I've seen examples:
pd.DataFrame(month_station.size().reset_index(name = "Group_Count"))
but the thing is, i don't require the size/count of my data but just grouping it in terms of month and station name which does not require count or sorts. I tried removing the size() and it gives me an error.
I just want the content of month_station to be ported into a dataframe so i could proceed and output as a csv file but it seemed complicated.
The nature of groupby is so that you can derive an aggregate calculation, such as mean or count or sum or etc. If you are merely trying to see on of each pair of month and station name, try this:
month_station = agg.groupby(['month','StationName'],as_index=False).count()
month_station = month_station['month','StationName']
I have a csv file with 8 columns in it. I want to plot a graph between 2 columns using matplotlib. One of the columns has repetitive values. I want to take the mean of the values from the other column which has same corresponding value in the first column.
How can I do it?
This isn't really specific to matplotlib. Pandas has nice support for this kind of data mangling. Read your csv file into a Pandas dataframe:
import pandas as pd
df = pd.read_csv('data.csv')
Then, assuming the column you want to group by is named 'key' and the column whose values you want to take means of is named 'value', you can do:
grouped = df.groupby('key').mean()
grouped.plot('value')
I am using pandas and python to process multiple files with different column names for columns with the same data.
dataset = pd.read_csv('Test.csv', index_col=0)
cols= dataset.columns
I have the different possible column titles in a list.
AddressCol=['sAddress','address','Adrs', 'cAddress']
Is there a way to normalize all the possible column names to "Address" in pandas so I use the script on different files?
Without pandas I would use something like a double for loop to go through the list of column names and possible column names and a if statement to extract out the whole array.
You can use the rename DataFrame method:
dataset.rename(columns={typo: 'Address' for typo in AddressCol}, inplace=True)