I have a dataset which has thousand of rows with a column 'state' and some other columns.
A sample dataset
import pandas as pd
data = {'State':['C','C','C','R','R','D','D','R','C','C','R','D','R','C','R','D','R'],
'Qd': [3, 2, 1, 0,2,2,5,7,9,7,14,34,12,10,11,14,15],
df = pd.DataFrame.from_dict(data)
The 'State' column has a loop type input value like 'C,R,D,R' and then again 'C,R,D,R'. I want to split the dataset into several dataframe like
df1
df2
df3
Each dataframe will contain one complete loop from the State Column. How to do this?
I am thinking to create a list of dataframe and use a for loop to store each loop value in each dataframe.
Following lines assume items containing set of periodic state values. data0 is a panda data frame prior to any splitting process. All state values are following orders as declared in the list items.
import pandas as pd
items=['C','R','D','R']
pattern_length=len(items)
count=0
current_state=data0.State[0]
dataframes=list()
temp_df=pd.DataFrame()
df_count=0
for index, row in data0.iterrows():
if current_state!=row.State[0]:
count=count+1
current_state=row.State[0]
if count==pattern_length:
dataframes.append(temp_df)
temp_df=pd.DataFrame()
count=0
df_count=df_count+1
temp_df=temp_df.append(row, ignore_index=True)
dataframes.append(temp_df)
df_count=df_count+1
Note that, dataframes[0],dataframes[1] and so on, are the data frames after the split. Also, df_count should give you total numbers of data frames created.
Related
I have two dataframes as shown below.
import databricks.koalas as ks
input_data = ks.DataFrame({'code':['123a', '345b', '678c'],
'id':[1, 2, 3]})
my_data = ks.DataFrame({'code':['123a', '12a', '678c'],
'id':[7, 8, 9], 'stype':['A', 'E', '5']})
These two dataframes have a column called code and I want to check the values in column code that exist in my_data and also exist in input_data and store them in a resulting dataframe called output. The output dataframe will have only the code column values that are present in the input_data. The number of columns in each dataframe can differ and I have just shown a sample here
The output dataframe will have a result such as follows based on the provided sample in this question.
display(output)
# Result is below
Code id
'123a' 7
I found solutions online that mostly use for loops but I was wondering if there is a more efficient way to approach this.
Thank you all!
Can try using an inner merge on the two dataframes, and then on the new dataframe, just keeping the one column you want.
For example,
df_new = my_data.merge(input_data, on='code')
df_new = df_new[['code', 'id']]
I want to create about 10 data frames with same number of rows and columns that I want to specify.
Currently I am creating a df with the specific rows and then using pd.concat to add column to the data frame. I am having to write 10 lines of code separately for each data frame. Is there a way to do it at one go together for all data frames. Say, all the data frames have 15 rows and 50 columns.
Also I don't want to use a loop. All values in the data frame are NaN and I want to perform different function on each data frame so editing one data frame shouldn't change the values of the other data frames.
You can simply create a numpy array of np.nan, and then create a dataframe:
df = pd.DataFrame(np.zeros([15, 50])*np.nan)
For creating 10 dataframes, you can just run this in a loop and add it to an array.
dfs = []
for i in range(10):
dfs.append(pd.DataFrame(np.zeros([15, 50])*np.nan))
Then you can index into dfs and change any value accordingly. It won't impact any other dataframe.
You could do something like this:
index_list = range(10)
column_list = ['a','b','c','d']
for i in range(5):
locals()["df_" + str(i)] = pd.DataFrame(index=index_list, columns=column_list)
This will create 5 different dataframes (df_1 to df_5) each with 10 rows and 4 columns named a,b,c,d having all values as Nan
import pandas as pd
row_num = 15
col_num = 50
temp=[]
for col_name in range(0, col_num):
temp.append(col_name)
Creation of Dataframe
df = pd.DataFrame(index=range(0,row_num), columns=temp)
this code creates a single data frame in pandas with specified row and column numbers. But without a loop or some form of iteration, multiple lines of same code must be written.
Note: this is a pure pandas implementation. github gist can be found here.
Suppose I had data with 12 columns the following would get me those 12 columns.
train_data = np.asarray(pd.read_csv(StringIO(train_data), sep=',', header=None))
inputs = train_data[:, :12]
However, lets say I want a subset of these columns (not all of them).
If I had a list
a=[1,5,7,10]
is there a smart way I can pass "a" so that I get a new dataframe whose columns will reflect the entries of "a" i.e first column of new data frame is the first column of the big dataframe, then next column in the new dataframe is the 5th column in the big dataframe, etc.
Thank you.
I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?
pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)
I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers