When converting Python Object to Dataframe, output is different - python

I am pulling data from an api, converting it into a Python Object, and attempting to convert it to a dataframe. However, when I try to unpack Python Object, I am getting extra rows than are in the Python Object.
You can see in my dataframe how on 2023-02-03, there are multiple rows. One row seems to be giving me the correct data while the other row is giving me random data. I am not sure where the extra row is coming from. I'm wondering if it has something to do with the null values or whether I am not unpacking the Python Object correctly.
My code
I double checked the raw data from the JSON response and don't see the extra values there. On Oura's UI, I checked the raw data and didn't notice anything there either.
Here's an example of what my desired output would look like:
enter image description here
Can anyone identify what I might be doing wrong?

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

What is an appropriate choice to store very large files with python? .csv files truncated data in certain cells

I'm writing a python script for the data acquisition phase of my project and so far I've been storing data in .csv files. While I was reading data from a particular .csv file, I got the error:
syntaxError: EOL while scanning string literal
I took a look at the specific row in the file and and the data in the specific cell were truncated. I am using pandas to store dicts to csv and it never threw an error. I guess .csv will save itself no matter what, even if that means it will delete data without any warning.
I thought of changing to .xls. When the same row was being stored, an error came up saying (something along the lines of):
Max character length reached. Max character length per cell was ~32k.
Then I thought that it may just be an excel/libreoffice calc issue (I tried both) and they can't visualize the data in cell but they are actually there. So I tried printing the specific cell; data were indeed truncated. The specific cell contains a dict, whose values are float, int, boolean or string. However, all of them have been converted to strings.
My question is, is there a way to fix it without changing the file format?
In the case that I have to change the file format, what would be an appropriate choice to store very large files? I am thinking about hdf5.
In case you need more info, do let me know. Thank you!
There is a limit to fields size:
csv.field_size_limit([new_limit])
Returns the current maximum field size allowed by the parser.
If new_limit is given, this becomes the new limit.
On my system (Python 3.8.0), I get:
>>> import csv
>>> csv.field_size_limit()
131072
which is exactly 128 kB.
You could try to set the limit higher:
csv.field_size_limit(your_new_limit)
But maybe a different file format would be more adapted depending on what kind of data you store.

How to only get only errors from insert_rows_from_dataframe method in Bigquery Client?

I am using client.insert_rows_from_dataframe method to insert data into my table.
obj = client.insert_rows_from_dataframe(table=TableRef, dataframe=df)
If there is no errors, obj will be an empty list of lists like
> print(obj)
[[] [] []]
But I want to know how to get the error messages out, if there are some errors while inserting?
I tried
obj[["errors"]] ?
but that is not correct. Please help.
To achieve the results that you want, you must set to your DataFrame a header identical to the one in your schema. For example, if you schema in BigQuery has the fields index and name, your DataFrame should have these two columns.
Lets take a look in the example below:
I created an table in BigQuery named insert_from_dataframe which contains the fields index, name and number, respectively INTEGER, STRING and INTEGER, all of them REQUIRED.
In the image below you can see that the insertion cause no errors.In the second image, we can see that the data was inserted.
No erros raised
Data inserted successfully
After that, I removed the column number for the last row of the same data. As you can see below, when I tried to push it to BigQuery, I got an error.
Given that, I would like to reinforce two points:
The error structured that is returned is a list of lists ( [],[],[],...]). The reason for that is because your data is supposed to be pushed in chunks (subsets of your data). In the function used you can specify how many rows each chunk will have using the parameter chunk_size=<number_of_rows>. Lets suppose that your data has 1600 rows and your chunk size is 500. You data will be divided into 4 chunks. The object returned after the insert request, hence, will consist of 4 lists inside a list, where each one of the four lists is related to one chunk. Its also important to say that if a row fails the process, all the rows inside the same chunk will not be inserted in the table.
If you are using string fields you should pay attention in the data inserted. Sometimes Pandas read null values as empty strings and it leads to a misinterpretation of the data by the insertion mechanism. In other words, its possible that you have empty strings inserted in your table while the expected result would be an error saying that the field can not be null.
Finally, I would like to post here some useful links for this problem:
BigQuery client documentation
Working with missing values in Pandas
I hope it helps.

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.
Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

Python Pandas Parser Error -- DtypeWarning

Everytime I import this one csv ('leads.csv') I get the following error:
/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (11,12,13,14,17,19,20,21) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I import many .csv's for this one analysis of which 'leads.csv' is only one. It's the only file with the problem. When I look at those columns in a spreadsheet application, the values are all consistent.
For example, Column 11 (which is Column K when using Excel), is a simple Boolean field and indeed, every row is populated and it's consistently populated with exactly 'FALSE' or exactly 'TRUE'. The other fields that this error message references have consistently-formatted string values with only letters and numbers. In most of these columns, there are at least some blanks.
Anyway, given all of this, I don't understand why this message keeps happening. It doesn't seem to matter much as I can use the data anyway. But here are my questions:
1) How would you go about identifying the any rows/records that are causing this error?
2) Using the low_memory=False option seems to be pretty unpopular in many of the posts I read. Do I need to declare the datatype of each field in this case? Or should I just ignore the error?

Categories

Resources