Store (df.info) method output in DataFrame or CSV - python

I have a giant Dataframe(df) that's dimensions are (42,--- x 135). I'm running a df.info on it, but the output is unreadable. I'm wondering if there is any way to dump it in a Dataframe or CSV? I think it has something to do with:
```buf : writable buffer, defaults to sys.stdout
```Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer
```if you need to further process the output."
But when i add a (buf = buffer) the output is just each word in the output then a new line which is very hard to read/work with. My goal is to be-able to better understand what columns are in the dataframe and to be able to sort them by type.

You need to open a file then pass the file handle to df.info:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)

You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:
import pandas as pd
def get_info(df: pd.DataFrame):
info = df.dtypes.to_frame('dtypes')
info['non_null'] = df.count()
info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
info['first_row'] = df.iloc[0]
info['last_row'] = df.iloc[-1]
return info
And write it to csv with df.to_csv('info_output.csv').
The memory usage information may also be useful, so you could do:
df.memory_usage().sum()

import pandas as pd
df = pd.read_csv('/content/house_price.csv')
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.csv", "w", encoding="utf-8") as f: f.write(s.split(" ----- ")[1].split("dtypes")[0])
di = pd.read_csv('df_info.csv', sep="\s+", header=None)
di

Just to build on mechanical_meat's and Adam Safi's combined solution, the following code will convert the info output into a dataframe with no manual intervention:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
info_output_df = pd.read_csv('info_output.txt', sep="\s+", header=None, index_col=0, engine='python', skiprows=5, skipfooter=2)
Note that according to the docs, the 'skipfooter' option is only compatible with the python engine.

Related

Convert a text file with a particular format into dataframe

I am new to Pandas and thus I wanted to know if I can convert my text file with a particular format into a Pandas data frame. Below is my text file format
"FACT"|"FSYM"|"POSITION"|"INDIRECT_OPTIONS"|"REPORT"|"SOURCE"|"COMMENTS"|
"ABCX"|"VVG1"|2800000|760000|2022-11-03|"A"|"INCLUDES CAR"|0
I wanted to convert this format in Pandas with same columns and values as separated by | sign. That is my data frame columns will be FACT, FYSM,POSITION, and so on.
I am trying below code but it does not give me the desired output.
def convert_factset_file_to_dataframe(test_case_name, file_name):
data = pd.read_csv("{}/output/Float_Ingestion_files/{}/{}.txt".format(str(parentDir), test_case_name, file_name), sep=',')
print(data)
It is printing as follows. Just adding the index.
"FACT"|"FSYM"|"POSITION"|"INDIRECT_OPTIONS"|"REPORT"|"SOURCE"|"COMMENTS"|
0 "ABCX"|"VVG1"|2800000|760000|2022-11-03|"A"|"INCLUDES CAR"|0
Is there any other way of converting my text file format to a data frame besides reading it as a CSV? Or I need to incorporate some changes in the code?
You can use the argument sep (as stated in Thomas' comment).
data = pd.read_csv(filepath, sep="|")
For more information, see the documentation.
df.to_csv(file_name, sep='\t')
To use a specific encoding (e.g. 'utf-8') use the encoding argument:
df.to_csv(file_name, sep='\t', encoding='utf-8')
I think you have a typo and should call
data = pd.read_csv(
"{}/output/Float_Ingestion_files/{}/{}.txt".format(
str(parentDir), test_case_name, file_name
),
sep="|", # <<<<<<<<< don't choose the comma here, choose `|`
)
That is, just change the argument for the separator to be the | sign

Avoid pandas converting 0,1 to True and False

I am fairly new to pandas. I am reading list of sql files from a folder and then writing the output to a text file using df.to_csv and then use those files to upload to redshift using COPY command.
One issue I am having is some of the boolean columns(1,0) are converting to True/False which I do not want as Redshift copy is throwing an error.
Here is my code
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()
I do not want to give specific column name in the logic to .astype(int) as I am processing around 100 files with different output columns with different datatypes.
Also df *1 did not work as it gave error for datetime column. Is there a solution for this? I am even okay with manipulating at df.to_csv.
I'm not sure if this is the most efficient solution but you can check the type of each column and if it's a boolean type you can encode the labels using sklearn's LabelEncoder
For example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
Just add this code snippet within your for loop, right before saving it as csv.
I found this works. Gusto's answer made me realize to play with iloc and came up to this solution.
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df = df.convert_dtypes(convert_boolean=False)
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()

How to fix data getting loaded into a single column of a pandas dataframe?

I have the following code:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset2 = pd.read_csv(file_path, header=None, dtype=str)
v = dataset2.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
dataset1 = pd.DataFrame(f)
df = dataset1.astype('str')
dataset = df.values.tolist()
print (type (dataset))
print (type (dataset[1]))
print (type (dataset[1][1]))
The target is to transfer all the dataset into values from 1..n for each different distinct value in dataset and afterwards to transform it into list of lists where each element is string.
The above code works great. However when I change the dataset into:
file_path ='https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
I get error. How can it work for this dataset as well?
You need to understand the data you're working with. A quick print call would've helped you realise the delimiters with this one are different.
Furthermore, it appears to be numeric data; you don't need an str conversion anymore.
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
t = pd.read_csv(file_path, header=None, delim_whitespace=True)
v = t.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
df = pd.DataFrame(f)
If you want pandas to guess the delimiter format, you might employ the use of sep=None:
t = pd.read_csv(file_path, header=None, sep=None)
I don't recommend this because it is very easy for pandas to make mistakes when loading your data with an inferred delimiter.

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

How can I read only the header column of a CSV file using Python?

I am looking for a a way to read just the header row of a large number of large CSV files.
Using Pandas, I have this method available, for each csv file:
>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns
I could do this with just the csv module:
>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.
My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.
How can I extract only the header row of a CSV file, quickly?
Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')
In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']
pandas can have the advantage that it deals more gracefully with CSV encodings.
I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.
import csv
with open(fpath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
Hopefully that helps!
I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:
import csv
from glob import iglob
unique_headers = set()
for filename in iglob('*.csv'):
with open(filename, 'rb') as fin:
csvin = csv.reader(fin)
unique_headers.update(next(csvin, []))
Here's one way. You get 1 row.
In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')
In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]:
a b c d
0 0.365453 0.633631 -1.917368 -1.996505
What about:
pandas.read_csv(PATH_TO_CSV, nrows=1).columns
That'll read the first row only and return the columns found.
you have missed nrows=1 param to read_csv
>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns
it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:
for filename in glob.glob(files_path+"\*.csv"):
with open(filename) as f:
first_line = f.readline()
it is easy you can use this:
df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()
In this case you can only read really few row for get your header
if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":
headers = pd.read_csv("test.csv", nrows=0)
import pandas as pd
get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

Categories

Resources