Write two dataframes to two different columns in CSV - python

I converted two arrays into two dataframes and would like to write them to a CSV file in two separate columns. There are no common columns in the dataframes. I tried the solutions as follows and also from stack exchange but did not get the result. Solution 2 has no error but it prints all the data into one column. I am guessing that is a problem with how the arrays are converted to df? I basically want two column values of Frequency and PSD exported to csv. How do I do that ?
Solution 1:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_BP_frq['tmp'] = 1
df_BP_psd['tmp'] = 1
df_500 = pd.merge(df_BP_frq, df_BP_psd, on=['tmp'], how='outer')
df_500 = df_500.drop('tmp', axis=1)
Error: Unable to allocate 2.00 TiB for an array with shape (274870566961,) and data type int64
Solution 2:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = df_BP_frq.merge(df_BP_psd, left_on='Frequency', right_on='PSD', how='outer')
No Error.
Result: The PSD values are all 0 and are seen below the frequency values in the lower rows.
Solution 3:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = pd.merge(df_BP_frq, df_BP_psd, on='tmp').ix[:, ('Frequency','PSD')]
Error: KeyError: 'tmp'
Exporting to csv using:
df_500.to_csv("PSDvalues500.csv", index = False, sep=',', na_rep = 'N/A', encoding = 'utf-8')

You can use directly store the array as columns of the dataframe. If the lengths of both arrays is same, the following method would work.
df_500 = pd.DataFrame()
df_500['Frequency'] = freq_BP[L_BP]
df_500['PSD'] = PSDclean_BP[L_BP]
If the lengths of the arrays are different, you can convert them to series and then add them as columns in the following way. This would make add nan for empty values in the dataframe.
df_500 = pd.DataFrame()
df_500['Frequency'] = pd.Series(freq_BP[L_BP])
df_500['PSD'] = pd.Series(PSDclean_BP[L_BP])

From your question what I understood is that you have two arrays you want to store them into one dataframe different columns and save that dataframe to csv with separate columns .
Creating two Numpy arrays of equal length .
import numpy as np
n1 = np.arange(2, 100, 0.01)
n2 = np.arange(3, 101, 0.01)
Creating an empty dataframe and storing the above arrays as columns of the dataframe
n = pd.DataFrame()
n['feq']= n1
n['psd'] = n2
Storing into Csv
n.to_csv(r"C\:...\dataframe.csv",index= False)
If they are unequal dataframes convert them as series and then store them in empty dataframe .

Related

Efficient way to append dataframes below each other

I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!

How to iterate over every two columns of a data frame in python using a function?

There is an excel file that I have to read and do some process on every two columns of it, and eventually concatenate them vertically in a two columns dataframe.
I code the process for the column 0 and 1,
Now I am struggling to generate a function to do the same for all the two sets of columns.
I first chose the first two columns from excel as bellow
data1 = pd.read_excel('Book2.xlsx', usecols=[0,1],parse_dates=True)
How can I generate data1 to data5 from columns(0,1) (2,3) (4,5) (6,7) and do pd.concat([data1,data2,data3,data4,data5])
I can’t shear the Excel file,however it looks like
stocks = [('2021-01-04', 113.4377, '2021-01-04','Nan'),
('2021-01-07', 125.8316, '2021-01-07',127.8212),
('2021-01-14', 108.4792, '2021-01-14',111.0318),
('2021-01-21', 99.584, '2021-01-21',144.6129),
]
df = pd.DataFrame(stocks,columns =['DateA', 'StockA', 'DateB', 'StockB'])
df
You can read the entire dataset, loop on df.columns and save all datasets in a dictionary:
data = pd.read_excel('Book2.xlsx', parse_dates=True)
d={}
for i in range(5):
d['data_' + str(i+1)] = data[data.columns[i*2:i*2+2]]
And finally concat all the datasets:
result = pd.concat([i for i in d.keys()])

Loop through a list of dataframes and Merge them by index

I have a list of 982 dataframes and would like to loop through it so I can merge them by index. I intend to merge the dataframe in the position [0] with the dataframe in the position [1], then the dataframe in the position [2] with the result of the merging between the dataframe [0] with the dataframe [1] and so on.
I tried this but it didn't seem to work:
aux4 = '/Users/lucasiancsamuels/Desktop/Res. Regional - COVID 19/Bases/Auxílio Emergencial/202004_AuxilioEmergencial.csv'
auxabr = pd.read_csv(aux4,chunksize=50000,encoding='Latin-1',sep=';')
chunk_list = []
#dividing the dataframe in chunks
for chunks in auxabr:
chunks.drop(chunks.columns[[4,5,6,7,8,9,10,11,12]],inplace=True,axis=1)
chunks.dropna(axis=0,inplace=True)
agrupado1 = chunks.groupby('CÓDIGO MUNICÍPIO IBGE')
auxemer1 = agrupado1['VALOR BENEFÍCIO']
valor1 = auxemer1.agg(np.sum)
chunks = (chunks.drop_duplicates('CÓDIGO MUNICÍPIO IBGE'))
chunks.index = chunks['CÓDIGO MUNICÍPIO IBGE']
chunks.index.astype(dtype=np.int64)
chunks.sort_index(inplace=True)
filtered_chunk = pd.concat([chunks,valor1],axis=1)
chunk_list.append(filtered_chunk)
#merge the dataframes by index - didn't work
for i in range(0,981):
filtered_data = pd.merge(left=chunk_list[i],right=chunk_list[i+1],on=chunk_list[i].index)
And gives back this error:
KeyError: Float64Index([1200013.0, 1200054.0, 1200104.0, 1200138.0, 1200179.0,
1200203.0],
dtype='float64', name='CÓDIGO MUNICÍPIO IBGE')
Lucas, I started writing another comment but it's coming a bit too long for comfort.
Firstly, na_values doesn't do what you think it does. This option is used when you have additional values you want pandas to treat as NaN. For example I can have -99 values in my spreadsheets indicating no value and when I load the csv I would instruct pandas to treat those as NA with na_values=-99.
What you need to do is: first load the csv as per usual, then use fillna to remove NaN values and finally cast the entire column to integer:
auxabr = pd.read_csv(aux4,chunksize=50000,encoding='Latin-1',sep=';')
auxabr['CÓDIGO MUNICÍPIO IBGE'].fillna(0, inplace=True, axis=1)
auxabr = auxabr.astype({'CÓDIGO MUNICÍPIO IBGE':'int'})
It should all work after that.

creating list from dataframe

I am a newbie to python. I am trying iterate over rows of individual columns of a dataframe in python. I am trying to create an adjacency list using the first two columns of the dataframe taken from csv data (which has 3 columns).
The following is the code to iterate over the dataframe and create a dictionary for adjacency list:
df1 = pd.read_csv('person_knows_person_0_0_sample.csv', sep=',', index_col=False, skiprows=1)
src_list = list(df1.iloc[:, 0:1])
tgt_list = list(df1.iloc[:, 1:2])
adj_list = {}
for src in src_list:
for tgt in tgt_list:
adj_list[src] = tgt
print(src_list)
print(tgt_list)
print(adj_list)
and the following is the output I am getting:
['933']
['4139']
{'933': '4139'}
I see that I am not getting the entire list when I use the list() constructor.
Hence I am not able to loop over the entire data.
Could anyone tell me where I am going wrong?
To summarize, Here is the input data:
A,B,C
933,4139,20100313073721718
933,6597069777240,20100920094243187
933,10995116284808,20110102064341955
933,32985348833579,20120907011130195
933,32985348838375,20120717080449463
1129,1242,20100202163844119
1129,2199023262543,20100331220757321
1129,6597069771886,20100724111548162
1129,6597069776731,20100804033836982
the output that I am expecting:
933: [4139,6597069777240, 10995116284808, 32985348833579, 32985348838375]
1129: [1242, 2199023262543, 6597069771886, 6597069776731]
Use groupby and create Series of lists and then to_dict:
#selecting by columns names
d = df1.groupby('A')['B'].apply(list).to_dict()
#seelcting columns by positions
d = df1.iloc[:, 1].groupby(df1.iloc[:, 0]).apply(list).to_dict()
print (d)
{933: [4139, 6597069777240, 10995116284808, 32985348833579, 32985348838375],
1129: [1242, 2199023262543, 6597069771886, 6597069776731]}

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources