I read in two .FITS tables and place them into "list_a" and "list_b". List_b is a subset of List_a, but has some additional e.g. "age"', that I'd like to add to my output. This is the current way I'm doing things:
file = open("MyFile.txt","w+")
for ii in range(100000):
name = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['NAME']
thing_from_b = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['AGE']
if (len(name) > 0) :
file.write(" {} {} \n".format(list_A['NAME'][ii], age )
file.close()
but is so slow and clunky, that I'm sure there must be a better, more pythonic method.
Assume "List_a" and "List_b" are both tables, and that you want to get the "ages" from "List_b" for entries where both "List_a" and "List_b", you can use Pandas as in your approach. But Astropy also has a built-in join operation for tables.
So I'm guessing you have something akin to:
>>> from astropy.table import Table
>>> tab_a = Table({'NAME': ['A', 'B', 'C']})
>>> tab_b = Table({'NAME': ['A', 'C', 'D'], 'AGE': [1, 3, 4]})
If you are reading from a FITS file you can use, for example Table.read to read a FITS table into a Table object (among other approaches).
Then you can use join to join the two tables where their name is the same:
>>> from astropy.table import join
>>> tab_c = join(tab_a, tab_b, keys='NAME')
>>> tab_c
<Table length=2>
NAME AGE
str1 int64
---- -----
A 1
C 3
I think that maybe be what you're asking.
You could then write this out to a an ASCII format (similar to in your example) like:
>>> import sys
>>> tab_c.write(sys.stdout, format='ascii.no_header')
A 1
C 3
(Here you could replace sys.stdout with a filename; I was just using it for demonstration purposes). As you can see there are many built-in output formats for Tables, though you can also define your own.
There are lots of goodies like this already in Astropy that should save you in many cases from reinventing the wheel when it comes to table manipulation and file format handling--just peruse the docs to get a better feel :)
Turns out turning the lists into DataFrames and then doing a pandas merge, does work very well::
from pandas import DataFrame
from astropy.table import Table
list_a_table = Table(list_a)
list_a_df = DataFrame(np.array(list_a_table))
list_b_table = Table(list_b)
list_b_df = DataFrame(np.array(list_b_table))
df_merge = pd.merge(list_a_df, list_b_df, on="name")
Related
am loading a parquet file from apache arrow (pyarrow), and so far, i necessarily needs to transfer to pandas, doing a conversion as categorical, and send it back as arrow table (to save it later as feather file type)
the code looks like it :
df = pq.read_table(inputFile)
# convert to pandas
df2 = df.to_pandas()
# get all cols that needs to be transformed and cast
list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
for str_obj_col in list_str_obj_cols:
df2[str_obj_col] = df2[str_obj_col].astype("category")
print(df2.dtypes)
#get back from pandas to arrow
table = pa.Table.from_pandas(df2)
# write the file in fs
ft.write_feather(table, outputFile, compression='lz4')
is there anyway to make this directly with pyarrow ? would it be faster ?
thanks in advance
In pyarrow "categorical" is referred to as "dictionary encoded". So I think your question is if it is possible to dictionary encode columns from an existing table. You can use the pyarrow.compute.dictionary_encode function to do this. Putting it all together:
import pyarrow as pa
import pyarrow.compute as pc
def dict_encode_all_str_columns(table):
new_arrays = []
for index, field in enumerate(table.schema):
if field.type == pa.string():
new_arr = pc.dictionary_encode(table.column(index))
new_arrays.append(new_arr)
else:
new_arrays.append(table.column(index))
return pa.Table.from_arrays(new_arrays, names=table.column_names)
table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))
I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated
As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i
I am reading data from multiple dataframes.
Since the indexing and inputs are different, I need to repeat the pairing and analysis. I need dataframe specific outputs. This pushes me to copy paste and repeat the code.
Is there a fast way to refer to multiple dataframes to do the same analysis?
DF1= pd.read_csv('DF1 Price.csv')
DF2= pd.read_csv('DF2 Price.csv')
DF3= pd.read_csv('DF3 Price.csv') # These CSV's contain main prices
DF1['ParentPrice'] = FamPrices ['Price1'] # These CSV's contain second prices
DF2['ParentPrice'] = FamPrices ['Price2']
DF3['ParentPrice'] = FamPrices ['Price3']
DF1['Difference'] = DF1['ParentPrice'] - DF1['Price'] # Price difference is the output
DF2['Difference'] = DF2['ParentPrice'] - DF2['Price']
DF3['Difference'] = DF3['ParentPrice'] - DF3['Price']```
It is possible to parametrize strings using f-strings, available in python >= 3.6. In an f string, it is possible to insert the string representation of the value of a variable inside the string, as in:
>> a=3
>> s=f"{a} is larger than 11"
>> print(s)
3 is larger than 1!
Your code would become:
list_of_DF = []
for symbol in ["1", "2", "3"]:
df = pd.read_csv(f"DF{symbol} Price.csv")
df['ParentPrice'] = FamPrices [f'Price{symbol}']
df['Difference'] = df['ParentPrice'] - df['Price']
list_of_DF.append(df)
then DF1 would be list_of_DF[0] and so on.
As I mentioned, this answer is only valid if you are using python 3.6 or later.
for the third part ill suggest to create a something like
DFS=[DF1,DF2,DF3]
def create_difference(dataframe):
dataframe['Difference'] = dataframe['ParentPrice'] - dataframe['Price']
for dataframe in DFS:
create_difference(dataframe)
for the second way there is no like superconvenient and short way i might think about , except maybe of
for i in range len(DFS) :
DFS[i]['ParentPrice'] = FamPrices [f'Price{i}']
I have two fits file data (file1.fits and file2.fits). The first one (file1.fits) consists of 80,700 important rows of data and another one is 140,000 rows. The both of them have the same Header.
$ python
>>> import pyfits
>>> f1 = pyfits.open('file1.fits')
>>> f2 = pyfits.open('file2.fits')
>>> event1 = f1[1].data
>>> event2 = f2[1].data
>>> len(event1)
80700
>>> len(event2)
140000
How can I combine file1.fits and file2.fits into new fits file (newfile.fits) with the same header as the old ones and the total number of rows of newfile.fits is 80,700+ 140,000 = 220,700 ?
I tried with astropy:
from astropy.table import Table, hstack
t1 = Table.read('file1.fits', format='fits')
t2 = Table.read('file2.fits', format='fits')
new = hstack([t1, t2])
new.write('combined.fits')
It seems to work with samples from NASA.
I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.
As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.