Pandas adding multiple null data frames - python

I want to create about 10 data frames with same number of rows and columns that I want to specify.
Currently I am creating a df with the specific rows and then using pd.concat to add column to the data frame. I am having to write 10 lines of code separately for each data frame. Is there a way to do it at one go together for all data frames. Say, all the data frames have 15 rows and 50 columns.
Also I don't want to use a loop. All values in the data frame are NaN and I want to perform different function on each data frame so editing one data frame shouldn't change the values of the other data frames.

You can simply create a numpy array of np.nan, and then create a dataframe:
df = pd.DataFrame(np.zeros([15, 50])*np.nan)
For creating 10 dataframes, you can just run this in a loop and add it to an array.
dfs = []
for i in range(10):
dfs.append(pd.DataFrame(np.zeros([15, 50])*np.nan))
Then you can index into dfs and change any value accordingly. It won't impact any other dataframe.

You could do something like this:
index_list = range(10)
column_list = ['a','b','c','d']
for i in range(5):
locals()["df_" + str(i)] = pd.DataFrame(index=index_list, columns=column_list)
This will create 5 different dataframes (df_1 to df_5) each with 10 rows and 4 columns named a,b,c,d having all values as Nan

import pandas as pd
row_num = 15
col_num = 50
temp=[]
for col_name in range(0, col_num):
temp.append(col_name)
Creation of Dataframe
df = pd.DataFrame(index=range(0,row_num), columns=temp)
this code creates a single data frame in pandas with specified row and column numbers. But without a loop or some form of iteration, multiple lines of same code must be written.
Note: this is a pure pandas implementation. github gist can be found here.

Related

Adding whole lines of a dataframe via a for loop

I had code as follows to collect interesting rows into a new dataframe:
df = df1.iloc[[66,113,231,51,152,122,185,179,114,169,97][:]]
but I want to use a for loop to collect the data. I have read that I need to combine the data as a list and then create the dataframe, but all the examples I have seen are for numbers and I can't create the same for each line of a dataframe. At the moment I have the following:
data = ['A','B','C','D','E']
for n in range(10):
data.append(dict(zip(df1.iloc[n, 4])))
df = pd.Dataframe(data)
(P.S. I have 4 in the code because I want the data to be selected via column E and the dataframe is already sorted so I am just looking for the first 10 rows)
Thanks in advance for your help.

How can I split dataset into several dataframe in python?

I have a dataset which has thousand of rows with a column 'state' and some other columns.
A sample dataset
import pandas as pd
data = {'State':['C','C','C','R','R','D','D','R','C','C','R','D','R','C','R','D','R'],
'Qd': [3, 2, 1, 0,2,2,5,7,9,7,14,34,12,10,11,14,15],
df = pd.DataFrame.from_dict(data)
The 'State' column has a loop type input value like 'C,R,D,R' and then again 'C,R,D,R'. I want to split the dataset into several dataframe like
df1
df2
df3
Each dataframe will contain one complete loop from the State Column. How to do this?
I am thinking to create a list of dataframe and use a for loop to store each loop value in each dataframe.
Following lines assume items containing set of periodic state values. data0 is a panda data frame prior to any splitting process. All state values are following orders as declared in the list items.
import pandas as pd
items=['C','R','D','R']
pattern_length=len(items)
count=0
current_state=data0.State[0]
dataframes=list()
temp_df=pd.DataFrame()
df_count=0
for index, row in data0.iterrows():
if current_state!=row.State[0]:
count=count+1
current_state=row.State[0]
if count==pattern_length:
dataframes.append(temp_df)
temp_df=pd.DataFrame()
count=0
df_count=df_count+1
temp_df=temp_df.append(row, ignore_index=True)
dataframes.append(temp_df)
df_count=df_count+1
Note that, dataframes[0],dataframes[1] and so on, are the data frames after the split. Also, df_count should give you total numbers of data frames created.

Creating dataframes from others

I would need to filter multiple data frames and create new data frames based on them.
The multiple data frames are called as df[str(i)], i.e. df["0"], df["1"], and so on.
I would need, after filtering the rows, to create new dataframes. I am trying as follows:
n=5
for i in range(0, n):
filtered = df[str(i)]
but it returns at the end only the latest dataframe created, i.e. n=5.
I have tried also with filtered[str(i)] but it gives me the error "n".
What I would like to have is:
filtered["0"] for df["0"]
filtered["1"] for df["1"]
...
I would appreciate your help to figure it out. Thanks
You could append your filtered dataframes to a list, then concatenate into a new dataframe.
import pandas as pd
n=5
dfs = []
for i in range(n):
filtered = df[str(i)]
dfs.append(filtered)
df_filtered = pd.concat(dfs)

How to get data correctly into dictionary

I've read a csv file into a pandas data frame, df, of 84 rows. There are n (6 in this example) values in a column that I want to use as keys in a dictionary, data, to convert to a data frame df_data. Column names in df_data come from the columns in df.
I can do most of this successfully, but I'm not getting the actual data into the dataframe. I suspect the problem is in my loop creating the dictionary, but can't figure out what's wrong.
I've tried subsetting df[cols], taking it out of a list, etc.
data = {}
cols = [x for x in df.columns if x not in drops] # drops is list of unneeded columns
for uni in unique_sscs: # unique_sscs is a list of the values to use as the index
for col in cols:
data[uni] = [df[cols]]
df_data = pd.DataFrame(data, index=unique_sscs, columns=cols)
Here's my result (they didn't paste, but all values show as NaN in Jupyter):
lab_anl_method_name analysis_date test_type result_type_code result_unit lab_name sample_date work_order sample_id
1904050740
1904050820
1904050825
1904050830
1904050840
1904050845

Select 2 ranges of columns to load - read_csv in pandas

I'm reading in an excel .csv file using pandas.read_csv(). I want to read in 2 separate column ranges of the excel spreadsheet, e.g. columns A:D AND H:J, to appear in the final DataFrame. I know I can do it once the file has been loaded in using indexing, but can I specify 2 ranges of columns to load in?
I've tried something like this....
usecols=[0:3,7:9]
I know I could list each column number induvidually e.g.
usecols=[0,1,2,3,7,8,9]
but I have simplified the file in question, in my real file I have a large number of rows so I need to be able to select 2 large ranges to read in...
I'm not sure if there's an official-pretty-pandaic-way to do it with pandas.
But, You can do it this way:
# say you want to extract 2 ranges of columns
# columns 5 to 14
# and columns 30 to 66
import pandas as pd
range1 = [i for i in range(5,15)]
range2 = [i for i in range(30,67)]
usecols = range1 + range2
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=usecols)
As #jezrael notes you can use numpy.r to do this in a more pythonic and legible way
import pandas as pd
import numpy as np
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=np.r_[0:3, 7:9])
Gotchas
Watch out when use in combination with names that you have allowed for the extra column pandas adds for the index ie. For csv columns 1,2,3 (3 items) np.r_ needs to be 0:3 (4 items)

Categories

Resources