Weird behavior of groupby using parquet - python

I have a dataframe df with df.columns=['ID','Month','Characteristic','Value'] and I want to know how many valuea there are for the subset=['ID','Month','Characteristic'] so I created first a new column df['Count']=1 and then apply
db=df.groupby(['ID','Month','Characteristic']['Count'].sum()
db=db.to_frame()
db=db.reset_index()
The weird thing is that if I upload df as parquet using:
import pyarrow.parquet as pq
_table = (pq.ParquetFile(path)
.read(use_pandas_metadata=True))
df = _table.to_pandas(strings_to_categorical=True)
when I compute db it gives me a memory error because it creates all possible combination: for example even if ID1 doesn't have the characteristic C1 in month M1 in db I obtain a row like:
ID
Characteristc
Month
Count
ID1
C1
M1
0
I said that it's weird because if first I save the parquet as csv and then upload that csv it gives me the right result, so in this case no lines with zero count. Do you have any idea?

The issues you have comes from the use of categorical.
Because you use strings_to_categorical=True when loading the data, the behaviour of the group by changes to generate an entry for every possible ID/Characteristic/Month.
You can either stop using strings_to_categorical=True, but this will increase the memory usage of your program.
Or you can change your group by to only show "observed" values:
table.to_pandas(strings_to_categorical=True).groupby(['ID','Month','Characteristic'], as_index=False, observed=True).size()

Related

Pyspark dataframe returns different results each time I run

Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe.
Here is the code I am using:
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.show(df_check.count(),False)
I ran df_check.show() 3 times and the column paciente_id_count gives different values everytime: show results (I cut the tables so It would be easier to compare).
How do I prevent this?
The .show() does not compute the whole operations.
Maybe you could try the following (if the final number of rows fits in your drive memory):
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_ id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.toPandas()

python checksum or hash need same output every execution

Trying to create unique key for a dataframe based on some columns. Used hashlib and zlib , both generating different values for each new python program execution for the same record in dataframe.
Looking for a way to create unique checksum and it should be the same for given data record in dataframe. There are many columns , so don't want to use concatenated column as a key. Any insights would be much appreciated. Sample code tested using hashlib and zlib below
Hashlib
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))
zlib.adler32
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (zlib.adler32(x) & 0xffffffff ))
Edited(10/21) Changed code and hitting new problem. Please review. Sorry for any confusion
Above code snippets have problem. For a row , some other row's column values hash was added in 'Unique travelid' column due to pd.DataFrame() altering original df rows order. Below modified code fetches respective column values for a given row but hitting new issue explained below
Modified code
stg_matchdf["Unique travelid_Sum"] = stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1)
stg_matchdf["Unique travelid_Key"] = stg_matchdf["Unique travelid_Sum"].apply(lambda x: (zlib.adler32(str(x).encode('utf-8')) & 0xffffffff))
stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1) is not concatenating columns in one particular order across multiple runs. Please see sample below for two runs. Entire length is same , but order of concatenation is random. So it is causing hashlib or zlib to return different values each time. Is there any way to specify order of columns in above code?
Run1:
AHKGCANADACANADANORTH AMERICA266430RDirect WDAYYZINTERNATIONALMANULIFE - CANADA TRANSIENTFeb-2020HONG KONGASIA/PACIFICPARTIAL REFUND2020-02-15Canada266430.02020-02-02Hong Kong2020-03-01QVKGS6
Run2:
YYZCANADAPARTIAL REFUND2664302020-02-02AMANULIFE - CANADA TRANSIENTHONG KONGNORTH AMERICA2020-03-01Hong KongQVKGS6INTERNATIONALDirect WDRHKGACanadaFeb-2020266430.02020-02-15CANADAASIA/PACIFIC

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

Pandas: Merge array is too big, large, how to merge in parts?

When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).
I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.
How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?
The method below works correctly for this small data, but fails on the larger, real data:
import pandas as pd
import numpy as np
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']
# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]
This is a follow up to two previous questions:
Does iterrows have performance issues?
What is a good way to avoid using iterrows in this example?
I answer my own Q below.
You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.
As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.
Following on from after you create the input tables
table3 = pd.DataFrame()
grouped = table1.groupby('scenario')
for _, group in grouped:
temp = pd.merge(group,table2, on='letter')
temp['calc']=temp['number1']*temp['number2']
table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
del temp

Get difference values between rows in Pandas DataFrame

Hi I have a result set from psycopg2 like so
(
(timestamp1, val11, val12, val13, val14),
(timestamp2, val21, val22, val23, val24),
(timestamp3, val31, val32, val33, val34),
(timestamp4, val41, val42, val43, val44),
)
I have to return the difference between the values of the row (exception for the timestamp column).
Each row would subtract the previous row values.
The first row would be
timestamp, 'NaN', 'NaN' ....
This has to then be returned as a generic object
Ie something like an array of the following objects
Group(timestamp=timestamp, rows=[val11, val12, val13, val14]
I was going to use Pandas to do the diff.
Something like below works ok on the values
df = DataFrame().from_records(data=results, columns=headers)
diffs = df.set_index('time', drop=False).diff()
But diff also performs on the timestamp column and I can't get it to ignore a column while
leaving the original timestamp column in place.
Also I wasn't sure it was going to be efficient to get the data into my return format
as Pandas advises against row access
What would a fast way to get the result set differences in my required output format ?
Why did you set drop=False? That puts the timestamps in the index (where they will not be touched by diff) but also leaves a copy of the timestamps as a proper column, to be process by diff.
I think this will do what you want:
diffs = df.set_index('time').diff().reset_index()
Since you mention psycopg2, take a look at the docs for pandas 0.14, released just a few days ago, which features improved SQL functionality, including new support for postgresql. You can read and write directly between the database and pandas DataFrames.

Categories

Resources