I am trying to tidy up a large (8gb) .csv file in python then stream it into BigQuery. My code below starts off okay, as the table is created and the first 1000 rows go in, but then I get the error:
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Is this perhaps related to the streaming buffer? My issue is that I will need to remove the table before i run the code again, otherwise the first 1000 entries will be duplicated due to the 'append' method.
import pandas as pd
destination_table = 'product_data.FS_orders'
project_id = '##'
pkey ='##'
chunks = []
for chunk in pd.read_csv('Historic_orders.csv',chunksize=1000, encoding='windows-1252', names=['Orderdate','Weborderno','Productcode','Quantitysold','Paymentmethod','ProductGender','DeviceType','Brand','ProductDescription','OrderType','ProductCategory','UnitpriceGBP' 'Webtype1','CostPrice','Webtype2','Webtype3','Variant','Orderlinetax']):
chunk.replace(r' *!','Null', regex=True)
chunk.to_gbq(destination_table, project_id, if_exists='append', private_key=pkey)
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
print(df.head(5))
pd.to_csv('Historic_orders_cleaned.csv')
Question:
- why streaming and not simply loading? This way you can upload batches of 1 GB instead of 1000 rows. Streaming is usually the case when you do have continuous data that needs to be appended as they happen. If you have a break of 1 day between the collection of the data and the load job it's usually safer to just load it. see here.
apart from that. I've had my share of issues loading tables in bigQuery from csv files and most of the times it was either 1) encoding (I see you have non utf-8 encoding) and 2) invalid characters, some comma that was lost in the middle of the file that broke the line.
To validate that, what if you insert the rows backwards? do you get the same error?
Related
Here is the program i am developing in python -
Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server -
Step2 - I use Pandas Dataframe to load JSON in to DF using
df = pd.read_json(jsonfile,index=False, header=False
Step3 - I use df.to_csv(temp_csvfile,..)
Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..
curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from
with open(temp_csvfile,'r') as f:
curr.copy_from(f,..)
conn.commit()
I seek feedback on below points -
a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..
b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )
c. Is there any better way of doing this ..
d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).
Please advise
You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter.
In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.
References :
Loading 20gb json file in pandas
Loading dataframes data into postgresql table article
Read a large json file into pandas
I have an h5 data file, which includes key rawreport
I can read the rawreport and save as dataframe using read_hdf(filename, "rawreport") without any problems. But the data has 17 mil rows and i'd like to use chunking
When I ran this code
chunksize = 10**6
someval = 100
df = pd.DataFrame()
for chunk in pd.read_hdf(filename, 'rawreport', chunksize=chunksize, where='datetime < someval'):
df = pd.concat([df, chunk], ignore_index=True)
I get "TypeError: can only use an iterator or chunksize on a table"
What does it mean that the rawreport isn't a table and how could I overcome this issue? I'm not the person who created the h5 file.
Chunking is only possible if your file was written in a Table format using PyTables. This must be specified when your file was first written:
df.to_hdf('rawreport', format = 'table')
If this wasn't specified when you wrote the file, then Pandas defaults to using a fixed format. This means that while the file can be quickly written and read later, it does mean that the entire dataframe must be read into memory. Unfortunately, this means that chunking and other options in read_hdf to specify particular rows or columns can't be used here.
I've successfully set up a Kafka instance at my job and I've been able to pass simple 'Hello World' messages through it.
However, I'm not sure how to do more interesting things. I've got a CSV that contains four records from a DB that I'm trying to move through kafka, then take into a DF on the other side and save it as a CSV again.
producer = KafkaProducer(boostrap_servers='my-server-id:443',
....
df = pd.read_csv('data.csv')
df = df.to_json()
producer.send(mytopic, df.encode('utf8'))
This returns code in a tuple object (conusmer.record object, bool) that contains a list of my data. I can access the data as:
msg[0][0][6].decode('utf8')
But that comes in as a single string that I can't pass to a dataframe simply (it just merges everything into one thing).
I'm not sure if I even need a dataframe or a to_json() method or anything. I'm really just not sure how to organize data to send properly and then return it and feed it back into a dataframe so that I can either a) save it to a CSV or b) reinsert the dataframe to a DB with to_Sql.
Kafka isn't really suited to send entire matricies/dataframes around.
You can send a list of CSV rows, JSON arrays, or preferrably some other compressable binary dataformat such as Avro or Protobuf as whole objects. If you are working exclusively in Python, you could pickle the data you send and receive.
When you read the data, you must deserialize it but how you do that, is ultimately your choice, and there is no simple answer for any given application.
The solution, for this one case, would be json_normalize, then to_csv, however... And I would like to point out that Kafka isn't required for you to test that, as you definitely should be writing unit tests...
df = pd.read_csv('data.csv')
jdf = df.to_json()
msg_value = jdf # pretend you got a message from Kafka, as a JSON string
df = pd.json_normalize(msg_value) # back to a dataframe
df.to_csv()
I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.
I have a very large json file (3m+ records) that causes memory issues when I try to import it using pandas.
data = pd.read_json("JSON_large_file.json")
This procedure causes problem when the file exceeds 100k records.
In order to get around this issue, I would like open the file, and search the record for an identifier (e.g. Customer_Number), and then append it to a dataframe.
Is this possible, and how could it be done?