Merging data from two dataframes for training

Merging data from two dataframes for training - python

I have the following two dataframe :
data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/digit-recognition/train_data.csv')
data_custom = pd.read_csv('/content/drive/My Drive/Colab Notebooks/digit-recognition/custom-data.csv',header=None)
I want to train my KNN model on the combined data. Is there a way to combine these two dataframes. The normal merge may not work
directly as the column headers are present in one and not the other. Althought there structure is exam same.
custom-data.csv file
https://drive.google.com/file/d/1Qj-zfWoaYbMMbEin1K0dFbFHfDFr_t85/view?usp=sharing
train_data.csv file
https://drive.google.com/file/d/1yDmKBt-boMfaF5LK2SN7MM8LeUZ1p6vD/view?usp=sharing
final_data = pd.concat([data, data_custom]) produces the following output
Here is the screenshot of custom-data.csv file
And here is the screenshot of train_data.csv file top rows -

You could try pd.concat:
data.columns = data_custom.columns
final_data = pd.concat([data, data_custom])

You could definitely use concat as said by #U12-Forward
Otherwise you should take a look at this page where it shows the difference between concat, merge, join and append, depending on what you need to do.

import pandas as pd
data = pd.read_csv("train_data.csv")
data_custom = pd.read_csv('custom-data.csv', header=None)
# combining data_custom df as row to the end of data df:
entry = data_custom.iloc[0].values
data.loc[len(data)] = entry
data shape at start: (42000, 785)
data shape after combining: (42001, 785)
The custom data is combined as another row.

Related

Concatenate two pandas dataframe and follow a sequence of uid

I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.

Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()

use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe

How to iterate over every two columns of a data frame in python using a function?

There is an excel file that I have to read and do some process on every two columns of it, and eventually concatenate them vertically in a two columns dataframe.
I code the process for the column 0 and 1,
Now I am struggling to generate a function to do the same for all the two sets of columns.
I first chose the first two columns from excel as bellow
data1 = pd.read_excel('Book2.xlsx', usecols=[0,1],parse_dates=True)
How can I generate data1 to data5 from columns(0,1) (2,3) (4,5) (6,7) and do pd.concat([data1,data2,data3,data4,data5])
I can’t shear the Excel file,however it looks like
stocks = [('2021-01-04', 113.4377, '2021-01-04','Nan'),
('2021-01-07', 125.8316, '2021-01-07',127.8212),
('2021-01-14', 108.4792, '2021-01-14',111.0318),
('2021-01-21', 99.584, '2021-01-21',144.6129),
]
df = pd.DataFrame(stocks,columns =['DateA', 'StockA', 'DateB', 'StockB'])
df

You can read the entire dataset, loop on df.columns and save all datasets in a dictionary:
data = pd.read_excel('Book2.xlsx', parse_dates=True)
d={}
for i in range(5):
d['data_' + str(i+1)] = data[data.columns[i*2:i*2+2]]
And finally concat all the datasets:
result = pd.concat([i for i in d.keys()])

How to sort dataframe by value in Pandas?

I have a data set in csv that I read with pd.read_csv. I want to sort the esxisting data by descanding value.
my code is this:
dataset = pd.read_csv('Kripto.csv')
sorted = dataset.sort_values(by = "Freq1", ascending=False)
x = dataset.iloc[:, :].values
and my data set (print(dataset)) is this :
Letter;Freq1
0 A;0.0817
1 B;0.0150
2 C;0.0278
3 D;0.0425
4 E;0.1270
when i want to use this code:
sorted = dataset.sort_values(by = "Freq1", ascending=False)
python gives me an error and says KeyError: 'Freq1'
I know that "Freq1" is not the name of the column but ı have no idea how to assing a name

Your csv file has " ; " as separator, you need to indicate on the read_csv method:
import pandas as pd
dataset = pd.read_csv('your.csv', sep=';')
And that's all you need to do

Your CSV file uses semi-colons to separate values. Since Pandas by defaults expects comma's, use
dataset = pd.read_csv('Kripto.csv', sep=';')
instead.
You should also use the sorted dataset to get your values in sorted order, instead of dataset, since the latter will remain unsorted:
x = sorted.iloc[:, :].values

Split Pandas Dataframe For Time Series

I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?

Your immediate issue is not re-assigning inside for-loop:
train = train.append(trainvals)
test = test.append(testvals)
However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.
def split_dfs(df):
df = df.sort_values(by='Date')
trainvals = df[:int(len(df)*0.8)]
testvals = df[int(len(df)*0.8):]
return {'train': trainvals, 'test': testvals}
dfs = [split_dfs(df) for g,df in df.groupby['itemid']]
train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])

You can avoid the loop with df.groupby.quantile.
train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train
Note this could have unexpected behavior if df.index is not unique.

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]

import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]

A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.

This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging data from two dataframes for training - python

You could try pd.concat: data.columns = data_custom.columns final_data = pd.concat([data, data_custom])

You could definitely use concat as said by #U12-Forward Otherwise you should take a look at this page where it shows the difference between concat, merge, join and append, depending on what you need to do.

Related

Concatenate two pandas dataframe and follow a sequence of uid

How to iterate over every two columns of a data frame in python using a function?

How to sort dataframe by value in Pandas?

Split Pandas Dataframe For Time Series

Extracting specific columns from pandas.dataframe

Categories

Resources