pandas incorrectly parsing csv - python

I've created a CSV file from a dataframe with a shape of (1081, 165233). I did this using this command: df2.to_csv("better_matched.csv")
However, whenever I try to load this csv as a pandas dataframe later on, the shape of the dataframe becomes (660, 165234). This is the code that I'm using to load the csv:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/Authorship/better_matched.csv")
df.shape
I think I need to change some parameters with .to_csv() or .read_csv() functions. What can be done?

Related

How to convert CSV to parquet file without RLE_DICTIONARY encoding?

I've already test three ways of converting a csv file to a parquet file. You can find them below. All the three created the parquet file. I've tried to view the contents of the parquet file using "APACHE PARQUET VIEWER" on Windows and I always got the following error message:
"encoding RLE_DICTIONARY is not supported"
Is there any way to avoid this?
Maybe a way to use another type of encoding?...
Below the code:
1º Using pandas:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet")
2º Using pyarrow:
from pyarrow import csv, parquet
table = csv.read_csv("filename.csv")
parquet.write_table(table, "filename.parquet")
3º Using dask:
from dask.dataframe import read_csv
dask_df = read_csv("filename.csv", dtype={'column_xpto': 'float64'})
dask_df.to_parquet("filename.parquet")
You should set use_dictionary to False:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet", use_dictionary=False)

Converte json file to csv file with proper formatted rows and columns in excel

Currently I'm working a script that can convert json file to csv format my script is working but I need to modify it to have proper data format like having rows and columns when the json file is converted to csv file, May I know what I need to add or modify on my script?
import pandas as pd
df = pd.read_json (r'/home/admin/myfile.json')
df.to_csv (r'/home/admin/xml/myfileSample.csv', index = None, sep=":")
Taking reference from your code,you can try
df.to_csv(r'/home/admin/xml/myfileSample.csv', encoding='utf-8', header=header,index = None, sep=":")
This could be useful.
import pandas as pd
df_json=pd.read_json("input_file.json")
df_json.head()
df_json.to_csv("output_file.csv",index=False)
Your code is all fine, Just change the to_csv to to_excel function and it should work all fine!
import pandas as pd
df = pd.read_json (r'/home/admin/myfile.json')
df.to_excel (r'/home/admin/xml/myfileSample.csv', index = None, sep=":")
Learn more about the to_excel function of pandas here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

Python: Reading tdms files using Python npTDMS and creating a Pandas dataframe

I'm able to read a labview .tdms file using Python npTDMS package, could read metadata and sample data for groups and channels as well.
However, the file has timestamp values with year '9999'. Hence getting the following error while converting to a pandas dataframe:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp:.
I went through the documentation in:
https://nptdms.readthedocs.io/en/stable/apireference.html#nptdms.TdmsFile.as_dataframe; however, couldn't find an option to deal with this data situation.
Tried passing errors='coerce' while calling as.dataframe() didn't work either. Any pointers or directions to read the .tdms file to a pandas dataframe, with this data situation, would be very helpful.
Changing the data at the source is not an option.
Code snippet to read tdms file:
import numpy as np
import pandas as pd
from nptdms import TdmsFile as td
tdms_file = td.read(<tdms file name>)
tdms_file_df = tdms_file.as_dataframe()
Error while creating a pandas dataframe

Pycharm does not show dataframe

When I'm reading a csv file using the pandas library, but when I print the head of the dataframe it doesn't show it as a dataframe but more like a list.
This is the code I tipped:
import pandas as pd
df = pd.read_csv('Path/File')
print(df.head())
The output looks like this
How do I get it show the data frame properly?

way to generate a specified number dataframe of new csv file from existing csv file in python using pandas

I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`
The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")

Categories

Resources