loading semi structured data to pandas - python

I have data that looks like this (from jq)
script_runtime{application="app1",runtime="1651394161"} 1651394161
folder_put_time{application="app1",runtime="1651394161"} 22
folder_get_time{application="app1",runtime="1651394161"} 128.544
folder_ls_time{application="app1",runtime="1651394161"} 3.868
folder_ls_count{application="app1",runtime="1651394161"} 5046
The dataframe should allow manipulation of each row to this:
script_runtime,app1,1651394161,1651394161
folder_put_time,app1,1651394161,22
Its in a textfile. How can I easily load it into pandas for data manipulation?

Load the .txt using pd.read_csv(), specifying a space as the separator (similar StackOverflow answer). The result will be a two-column dataframe with the bracketed text in the first column, and the float in the second column.
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
Parse the bracketed text into separate columns:
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
The result is a dataframe looks like this:
If you want to drop the first column which contains the bracketed value:
df = df.iloc[: , 1:]
Full code:
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
df = df.iloc[: , 1:]

Related

Why all , are not converted to decimals when importing in Pandas?

I am using the following code to import the CSV file. It works well except for when it encounters a three digit number followed by a decimal. Below is my code and the result
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def fft(x, Plot_ShareY=True):
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN') #loads the csv files
#replaces non-numeric symbols to NaN.
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
#print(dfs) #before dropping NaNs
#each column taken into a separate variable
time = dfs['Time'] #- np.min(dfs['Time'])
channelA = dfs['Channel A']
channelB = dfs['Channel B']
channelC = dfs['Channel C']
channelD = dfs['Channel D']
channels = [channelA, channelB, channelC, channelD]
#printing the smallest index number which is NaN
ind_num_A = np.where(channelA.isna())[0][0]
ind_num_B = np.where(channelB.isna())[0][0]
ind_num_C = np.where(channelC.isna())[0][0]
ind_num_D = np.where(channelD.isna())[0][0]
ind_num = [ind_num_A, ind_num_B, ind_num_C, ind_num_D]
#dropping all rows after the first NaN is found
rem_ind = np.amin(ind_num) #finds the array-wise minimum
#print('smallest index to be deleted is: ' +str(rem_ind))
dfs = dfs.drop(dfs.index[rem_ind:])
print(dfs) #after dropping NaNs
The result is as I want except for the last five rows in Channel B and C, where a comma is seen instead of a point to indicate decimal. I don't know why it works everywhere else but not for a few rows. The CSV file can be found here.
It looks like a data type issue. Some of the values are strings so pandas will not automatically convert to float before replacing ',' with '.'.
one option is to convert each column after you read the file with something like: df['colname'] = df['colname'].str.replace(',', '.').astype(float)
I think you need to replace the non-numeric symbols -∞ and ∞ as NaN already while reading, and not after the fact. If you do it after the data frame is created, then the values have been read in and it's parsed as data type str intead of float. This messes up the data types of the column.
So instead of na_values='NaN' do this na_values=["-∞", "∞"], so the code is like this:
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values=["-∞", "∞"])
#replaces non-numeric symbols to NaN.
# dfs = dfs.replace({'-∞': np.nan, '∞': np.nan}) # not needed anymore

Create dataframe from a string Python

How do I create a dataframe from a string that look like this (part of the string)
,file_05,,\r\nx data,y
data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,
Trying to make a dataframe that look like this
In the code below s is the string:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s)).dropna(axis=1)
df.rename(columns={df.columns[0]: ""}, inplace=True)
By the way, if the string comes from a csv file then it is simpler to read the file directly using pd.read_csv.
Edit: This code will create a multiindex of columns:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s), header = None).dropna(how="all", axis=1).T
df[0] = df.loc[1, 0]
df = df.set_index([0, 1]).T
Looks like you want a multi-level dataframe from the string. Here's how I would do it.
Step 1: Split the string by '\r\n'. Then for each value, split by
','
Step 2: The above step will create a list of list. Element #0 has 4
items and element #1 has 2 items. The rest have 3 items each and is
the actual data
Step 3: Convert the data into a dictionary from element #3 onwards.
Use values in element #2 as keys for the dictionary (namely x data
and y data). To ensure you have key:[list of values], use the
dict.setdefault(key,[]).append(value). This will ensure the data
is created as a `key:[list of values]' dictionary.
Step 4: Create a normal dataframe using the dictionary as all the
values are stored as key and values in the dictionary.
Step 5: Now that you have the dictionary, you want to create the
MultiIndex. Convert the column to MultiIndex.
Putting all this together, the code is:
import pandas as pd
text = ',file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,'
line_text = [txt.split(',') for txt in text.split('\r\n')]
dct = {}
for x,y,z in line_text[2:]:
dct.setdefault(line_text[1][0], []).append(x)
dct.setdefault(line_text[1][1], []).append(y)
df = pd.DataFrame(dct)
df.columns = pd.MultiIndex.from_tuples([(line_text[0][i],line_text[1][i]) for i in [0,1]])
print (df)
Output of this will be:
file_05
x data y data
0 -970.0 -34.12164
1 -959.0 -32.37526
2 -949.0 -30.360199
3 -938.0 -28.74816
4 -929.0 -27.53912
5 -920.0 -25.92707
6 -911.0 -24.31503
7 -900.0 -23.64334
8 -891.0 -22.29997
You should convert your raw data to a table with python.
Save to csv file by import csv package with python.
from pandas import DataFrame
# s is raw datas
s = ",file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,"
# convert raw data to a table
table = [i.split(',') for i in s.split("\r\n")]
table = [i[:2] for i in table]
# table is like
"""
[['', 'file_05'],
['x data', 'y data'],
['-970.0', '-34.12164'],
['-959.0', '-32.37526'],
['-949.0', '-30.360199'],
...
['-891.0', '-22.29997']]
"""
# save to output.csv file
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table)
# Save to DataFrame df
from pandas import DataFrame
df = DataFrame (table[2:],columns=table[1][:2])
print(df)

How can I read a double-semicolon-separated .csv with quoted values using pandas?

I analyse huge financial data-sets that often give me trouble because of corrupt data fields. Luckily, in the near future I get the opportunity to change the way data is delivered to me. The data will get delivered as a double-semicolon-separated txt-file with the fields in double quotation marks, i.e. "A";;"B";;"C"
In using pandas' read_csv to convert this file to a pandas df, however, pandas doesn't seem to recognize the double quotation marks, only the double-semicolon separator. Because the output looks like: "A" "B" "C", instead of A B C
I've tried passing quotechar='"' as a parameter and quoting=csv.QUOTE_ALL, but that doesn't change anything.
import pandas as pd
import csv
def create_df(loc):
df = pd.read_csv(loc, sep=';;', dtype=object, encoding="ISO-8859-1", quotechar='"', quoting=csv.QUOTE_ALL, header=None)
return df
directory = 'C:\\PycharmProjects\\Test\\'
file = directory + 'test;;qq;;.txt'
df = create_df(file)
writer = pd.ExcelWriter('test.xlsx')
df.to_excel(writer, 'test')
writer.save()
This is a bug when pandas has to use the python engine due to the separator not being a single character, if you pass a single character separator then it imports and parses those columns correctly but you end up with additional columns:
In[80]:
import csv
t='''"A";;"B";;"C"'''
df = pd.read_csv(io.StringIO(t), sep=';', quoting=csv.QUOTE_ALL)
df
Out[80]:
Empty DataFrame
Columns: [A, Unnamed: 1, B, Unnamed: 3, C]
Index: []
then you can drop the extra columns by filtering:
In[81]:
df = df.loc[:,~df.columns.str.contains('Unnamed:')]
df
Out[81]:
Empty DataFrame
Columns: [A, B, C]
Index: []

How to read csv avoid index?

the csv data is like this:
,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
and I want here is my program:
data = pd.read_csv('train.csv',delimiter=',')
group = data.drop('quality',axis=1).values
print(group[0])
I want the result is 7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6, but the it comes 0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8. So how to avoid the index column?
There is problem your data before first , are not converted to index, so need index_col=[0]. Then after call .values first column is omited:
data = pd.read_csv('train.csv',delimiter=',', index_col=[0])
Or:
data = pd.read_csv('train.csv', index_col=[0])

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

Categories

Resources