How to Convert a text data into DataFrame - python

How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.

Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])

import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747

Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665

wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

Querying a list object from API and returning it into dataframe - issues with format

I have the below script that returns data in a list format per quote of (i). I set up an empty list, and then query with the API function get_kline_data, and pass each output into my klines_list with the .extend function
klines_list = []
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_list.extend([i,klines])
klines_list
klines_list then returns data in this format;
['REQ-ETH',
[['1619317500',
'0.0000491',
'0.0000491',
'0.0000491',
'0.0000491',
'5.1147',
'0.00025113177']],
'REQ-BTC',
[['1619317500',
'0.00000219',
'0.00000219',
'0.00000219',
'0.00000219',
'19.8044',
'0.000043371636']],
'XLM-BTC',
[['1619317500',
'0.00000863',
'0.00000861',
'0.00000863',
'0.00000861',
'653.5693',
'0.005629652673']]]
I then try to convert it into a dataframe;
import pandas as py
df = py.DataFrame(klines_list)
And this is the result;
0
0 REQ-ETH
1 [[1619317500, 0.0000491, 0.0000491, 0.0000491,...
2 REQ-BTC
3 [[1619317500, 0.00000219, 0.00000219, 0.000002...
4 XLM-BTC
5 [[1619317500, 0.00000863, 0.00000861, 0.000008..
The structure of the DF is incorrect and it seems to be due to the way I have put my list together.
I would like the quantitative data in a column corresponding to the correct entry in list a, not in rows. Also, the ticker data, or list a, ("REQ-ETH/REQ-BTC") etc should be in a separate column. What would be a good way to go about restructuring this?
Edit: #Ynjxsjmh
This is the output when following the suggestion below for appending a dictionary within the for loop
REQ-ETH REQ-BTC XLM-BTC
0 [1619317500, 0.0000491, 0.0000491, 0.0000491, ... NaN NaN
1 NaN [1619317500, 0.00000219, 0.00000219, 0.0000021... NaN
2 NaN NaN [1619317500, 0.00000863, 0.00000861, 0.0000086...
pandas.DataFrame() can accept a dict. It will construct the dict key as column header, dict value as column values.
import pandas as pd
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
klines_data = {}
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_data[i] = klines[0]
# ^
# |
# Add a key to klines_data
df = pd.DataFrame(klines_data)
print(df)
REQ-ETH REQ-BTC XLM-BTC
0 1619317500 1619317500 1619317500
1 0.0000491 0.00000219 0.00000863
2 0.0000491 0.00000219 0.00000861
3 0.0000491 0.00000219 0.00000863
4 0.0000491 0.00000219 0.00000861
5 5.1147 19.8044 653.5693
6 0.00025113177 0.000043371636 0.005629652673
If the length of klines is not equal, you can use
df = pd.DataFrame.from_dict(klines_data, orient='index').T

How to extract values from a list in Python and put into a dataframe

I have trained a model and have asked the model to produce the coefficients:
modelcoeffs = model.fit(X_train, y_train).coef_
coeffslist = list(modelcoeffs)
which yiels me for example:
print(coeffslist):
[0.17005542 0.72965947 0.6833308 0.02509676]
I am trying to split these 4 coefficients out however they dont seem to be individual elements?
does anyone know how to split these into four numbers?
I am trying to get:
df['1'] = coeffslist[0]
df['2'] = coeffslist[1]
df['3'] = coeffslist[2]
df['4'] = coeffslist[3]
But it gives me NaN in the df. Does anyone have any ideas? thanks!
UPDATE
I am basically trying to get the coeffs to append to a df
print(df)
1 2 3 4
.... ..... ..... .....
0.17005542 0.72965947 0.6833308 0.02509676
This coeffslist doesn't look like a valid Python structure, it's missing commas.
But you might try this:
import pandas as pd
df = pd.DataFrame([0.17005542, 0.72965947, 0.6833308, 0.02509676])
print(df)
Output:
0
0 0.170055
1 0.729659
2 0.683331
3 0.025097
To get the coefs as row try this:
import pandas as pd
df = pd.DataFrame(columns=list("1234"))
df.loc[len(df)] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
And if you want to add another row (append) of coefs, just do this:
df.loc[1] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
1 0.170055 0.729659 0.683331 0.025097
you can convert [0.17005542 0.72965947 0.6833308 0.02509676] to a sting, split it on space, convert to float again and then append to a dataframe.
str_list= str(coeffslist[0])
float_list= [float(x) for x in str_list.split()]
df=pd.DataFrame(columns=['1','2','3','4'])
a_series = pd.Series(float_list, index = df.columns)
df = df.append(a_series, ignore_index=True)

I want to add the values of two cells present in the same column based on their " index = somevalue"

I have a data frame with the column "Key" as index like below:
Key
Prediction
C11D0 0
C11D1 8
C12D0 1
C12D1 5
C13D0 3
C13D1 9
C14D0 4
C14D1 9
C15D0 5
C15D1 3
C1D0 5
C2D0 7
C3D0 4
C4D0 1
C4D1 9
I want to add the values of two cells in Prediction column when their "index = something". The logic is I want to add the values whose index matches for upto 4 letters. Example: indexes having "C11D0 & C11D1" or having "C14D0 & C14D1" ? Then the output will be:
Operation
Addition Result
C11D0+C11D1 8
C12D0+C12D1 6
C13D0+C13D1 12
you can use isin function.
Example:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':[1,2,1,3,7,1]})
df[df.id.isin([1,5,6])].value.sum()
output:
9
for your case
idx = ['C11D0', 'C11D1']
print(df[df.Key.isin(idx)].Prediction.sum()) #outputs 8
First set key as a column if it is the index:
df.reset_index(inplace=True)
Then you can use DataFrame.loc with boolean indexing:
df.loc[df['key'].isin(["C11D0","C11D1"]),'Prediciton'].sum()
You can also create a function for it:
def sum_select_df(key_list,df):
return pd.concat([df[df['Key'].isin(['C'+str(key)+'D1','C'+str(key)+'D0'])] for key in key_list])['Prediction'].sum()
sum_select_df([11,14],df)
Output:
21
Here is a complete solution, slightly different from the other answers so far. I tried to make it pretty self-explanatory, but let me know if you have any questions!
import numpy as np # only used to generate test data
import pandas as pd
import itertools as itt
start_inds = ["C11D0", "C11D1", "C12D0", "C12D1", "C13D0", "C13D1", "C14D0", "C14D1",
"C15D0", "C15D1", "C1D0", "C2D0", "C3D0", "C4D0", "C4D1"]
test_vals = np.random.randint(low=0, high=10, size=len(start_inds))
df = pd.DataFrame(data=test_vals, index=start_inds, columns=["prediction"])
ind_combs = itt.combinations(df.index.array, 2)
sum_records = ((f"{ind1}+{ind2}", df.loc[[ind1, ind2], "prediction"].sum())
for (ind1, ind2) in ind_combs if ind1[:4] == ind2[:4])
res_ind, res_vals = zip(*sum_records)
res_df = pd.DataFrame(data=res_vals, index=res_ind, columns=["sum_result"])

Create a dataframe from a list with multiple columns

I want to create a dataframe from a list, the thing is that my column name is also in the list.
List:
['Input_file_column_name,Is_key,Config_file_column_name,Value\nEmployee ID,Y,identifierValue,identityTypeCode:001\nCumb ID,N,identifierValue,identityTypeCode:002\nFirst Name,N,first_Name \nLast Name,N,last_Name \nEmail,N,email_Address \nEntityID,N,entity_Id,entity_Id:01\nSourceCode,N,sourceCode,sourceCode:AHRWB\n']
Resulting dataframe:
Input_file_column_name Is_key Config_file_column_name Value
0 Employee ID Y identifierValue identityTypeCode:001
1 Cumb ID N identifierValue identityTypeCode:002
5 EntityID N entity_Id entity_Id:01
6 SourceCode N sourceCode sourceCode:AHRWB
How do I convert it? Do I convert the list to a dictionary and then do it or is there a way that it can be done directly?
Code:
import pandas as pd
with open('onboard_config.txt') as myFile:
text = myFile.read()
result = text.split("regex")
print result
df=pd.DataFrame[[sub.split(",") for sub in result]]
Seems like you need splitlines then convert to Series.str.split
df=pd.Series(l[0].splitlines()).str.split(',',expand=True).T.set_index(0).T.dropna()
df
Out[1183]:
0 Input_file_column_name ... Value
1 Employee ID ... identityTypeCode:001
2 Cumb ID ... identityTypeCode:002
6 EntityID ... entity_Id:01
7 SourceCode ... sourceCode:AHRWB
[4 rows x 4 columns]
split=list[0].split('\n')
df= []
for i in split:
df.append(i.split(','))
columns= df[0]
df=df[1:]
pd.DataFrame(df, columns=columns)
This will give you your desired df.

Categories

Resources