I have a pandas data frame in which column A has values as lists.
the size of the data frame is 14 million rows.
I have another list:
code = ['q23', 'r45', 'y67']
I want to create a new column B in the data frame if any of the values in column A are in code.
any([True if x in df.A if x in code]) doesnt work
Related
I want to create about 10 data frames with same number of rows and columns that I want to specify.
Currently I am creating a df with the specific rows and then using pd.concat to add column to the data frame. I am having to write 10 lines of code separately for each data frame. Is there a way to do it at one go together for all data frames. Say, all the data frames have 15 rows and 50 columns.
Also I don't want to use a loop. All values in the data frame are NaN and I want to perform different function on each data frame so editing one data frame shouldn't change the values of the other data frames.
You can simply create a numpy array of np.nan, and then create a dataframe:
df = pd.DataFrame(np.zeros([15, 50])*np.nan)
For creating 10 dataframes, you can just run this in a loop and add it to an array.
dfs = []
for i in range(10):
dfs.append(pd.DataFrame(np.zeros([15, 50])*np.nan))
Then you can index into dfs and change any value accordingly. It won't impact any other dataframe.
You could do something like this:
index_list = range(10)
column_list = ['a','b','c','d']
for i in range(5):
locals()["df_" + str(i)] = pd.DataFrame(index=index_list, columns=column_list)
This will create 5 different dataframes (df_1 to df_5) each with 10 rows and 4 columns named a,b,c,d having all values as Nan
import pandas as pd
row_num = 15
col_num = 50
temp=[]
for col_name in range(0, col_num):
temp.append(col_name)
Creation of Dataframe
df = pd.DataFrame(index=range(0,row_num), columns=temp)
this code creates a single data frame in pandas with specified row and column numbers. But without a loop or some form of iteration, multiple lines of same code must be written.
Note: this is a pure pandas implementation. github gist can be found here.
Suppose I had data with 12 columns the following would get me those 12 columns.
train_data = np.asarray(pd.read_csv(StringIO(train_data), sep=',', header=None))
inputs = train_data[:, :12]
However, lets say I want a subset of these columns (not all of them).
If I had a list
a=[1,5,7,10]
is there a smart way I can pass "a" so that I get a new dataframe whose columns will reflect the entries of "a" i.e first column of new data frame is the first column of the big dataframe, then next column in the new dataframe is the 5th column in the big dataframe, etc.
Thank you.
I've read a csv file into a pandas data frame, df, of 84 rows. There are n (6 in this example) values in a column that I want to use as keys in a dictionary, data, to convert to a data frame df_data. Column names in df_data come from the columns in df.
I can do most of this successfully, but I'm not getting the actual data into the dataframe. I suspect the problem is in my loop creating the dictionary, but can't figure out what's wrong.
I've tried subsetting df[cols], taking it out of a list, etc.
data = {}
cols = [x for x in df.columns if x not in drops] # drops is list of unneeded columns
for uni in unique_sscs: # unique_sscs is a list of the values to use as the index
for col in cols:
data[uni] = [df[cols]]
df_data = pd.DataFrame(data, index=unique_sscs, columns=cols)
Here's my result (they didn't paste, but all values show as NaN in Jupyter):
lab_anl_method_name analysis_date test_type result_type_code result_unit lab_name sample_date work_order sample_id
1904050740
1904050820
1904050825
1904050830
1904050840
1904050845
I have a csv file like the following:
#1, #2, #3, #4, #5, #6
I want to generate a data frame using the read_csv method, but how do I assign the values in the first 5 columns to a single column in my data frame as a list? And how could I apply a heading to that column?
Create a new DataFrame.
Supposing your previous DataFrame object is df=read_csv('something.csv')
df1 = pd.DataFrame(columns=['ColName1','ColName2']) #new DataFrame with column Names as ColName1 and ColName2
c=0 #Index for new DataFrame
for i in range(len(df)):
df1.loc[c] = [None for x in range(2)] #Row elements initially Null
df1.loc[c].ColName1 = []
for j in range(1,5):
df1.loc[c].ColName1.append(df.loc[i]['#' + str(j)])
df1.loc[c].ColName2 = df.loc[i]['#6']
c += 1
df1.to_csv('new.csv')
I recently started working with pandas dataframes.
I have a list of dataframes called 'arr'.
Edit: All the dataframes in 'arr' have same columns but different data.
Also, I have an empty dataframe 'ndf' which I need to fill in using the above list.
How do I iterate through 'arr' to fill in the max values of a column from 'arr' into a row in 'ndf'
So, we'll have
Number of rows in ndf = Number of elements in arr
I'm looking for something like this:
columns=['time','Open','High','Low','Close']
ndf=DataFrame(columns=columns)
ndf['High']=arr[i].max(axis=0)
Based on your description, I assume a basic example of your data looks something like this:
import pandas as pd
data =[{'time':'2013-09-01','open':249,'high':254,'low':249,'close':250},
{'time':'2013-09-02','open':249,'high':256,'low':248,'close':250}]
data2 =[{'time':'2013-09-01','open':251,'high':253,'low':248,'close':250},
{'time':'2013-09-02','open':245,'high':251,'low':243,'close':247}]
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
arr = [df, df2]
If that's the case, then you can simply iterate over the list of dataframes (via enumerate()) and the columns of each dataframe (via iteritems(), see http://pandas.pydata.org/pandas-docs/stable/basics.html#iteritems), populating each new row via a dictionary comprehension: (see Create a dictionary with list comprehension in Python):
ndf = pd.DataFrame(columns = df.columns)
for i, df in enumerate(arr):
ndf = ndf.append(pd.DataFrame(data = {colName: max(colData) for colName, colData in df.iteritems()}, index = [i]))
If some of your dataframes have any additional columns, the resulting dataframe ndf will have NaN entries in the relevant places.