Is it possible to manipulate a dataframe created through Pandas using SQL? - python

So I'm trying to create a python script that allows me to perform SQL manipulations on a dataframe (masterfile) I created using pandas. The dataframe draws its contents from the csv files found in a specific folder.
I was able to successfully create everything else, but I am having trouble with the SQL manipulation part. I am trying to use the dataframe as the "database" where I will pull the data using my SQL query but I am getting a "AttributeError: 'DataFrame' object has no attribute 'cursor' " error.
I'm not really seeing a lot of examples for pandas.read_sql_query() so I am having a difficult time on understanding how I will use my dataframe in it.
import os
import glob
import pandas
os.chdir("SOMECENSOREDDIRECTORY")
all_csv = [i for i in glob.glob('*.{}'.format('csv')) if i != 'Masterfile.csv']
edited_files = []
for i in all_csv:
df = pandas.read_csv(i)
df["file_name"] = i.split('.')[0]
edited_files.append(df)
masterfile = pandas.concat(edited_files, sort=False)
print("Data fields are as shown below:")
print(masterfile.iloc[0])
sql_query = "SELECT Country, file_name as Year, Happiness_Score FROM masterfile WHERE Country = 'Switzerland'"
output = pandas.read_sql_query(sql_query, masterfile)
output.to_csv('data_pull')
I know this part is wrong, but this is the concept I am trying to get to work but don't know how:
output = pandas.read_sql_query(sql_query, masterfile)
I appreciate any help I can get! I am a self-thought python programmer by the way, so I might be missing some general rule or something. Thanks!
Edit: replaced "slice" with "manipulate" because I realized I didn't want to just slice it. Also fixed some alignment issues on my code block.

It's is possible to slice dataframe Which created through Pandas and SQL You can use loc function of pandas to slice dataframe.
pd.df.loc[row,colums]

Related

Save dictionary as a pyspark Dataframe and load it - Python, Databricks

I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.
Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.

Read Excel sheet table (Listobject) into python with pandas

There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)

How do I execute this python code automatically in in excel cells?

I need to extract the domain for example: (http: //www.example.com/example-page, http ://test.com/test-page) from a list of websites in an excel sheet and modify that domain to give its url (example.com, test.com). I have got the code part figured put but i still need to get these commands to work on excel sheet cells in a column automatically.
here's_the_code
I think you should read in the data as a pandas DataFrame (pd.read_excel), make a function from your code then apply to the dframe (df.apply). Then it is easy to save to excel with pd.to_excel().
ofc you will need pandas to be installed.
Something like:
import pandas as pd
dframe = pd.read_excel(io='' , sheet_name='')
dframe['domains'] = dframe['urls col name'].apply(your function)
dframe.to_excel('your path')
Best

Convert dataframe column into integers from within loop

I'm trying loop through a folder of csv's and put them into a dataframe, change certain columns into an integer, before passing them through a Django model. Here is my code:
import glob
import pandas as pd
path = 'DIV1FCS_2017/*/*'
for fname in glob.glob(path):
df = pd.read_csv(fname)
df['Number'].apply(pd.to_numeric)
I am receiving the following: ValueError: Unable to parse string
Does anybody know if I can convert a column of strings into integers using pd.to_numeric from within a loop? Outside of the loop it seems to work properly.
I think you probably have some non-numbers data stored in your dataframe, and that's what's casuing the error.
You can examine your data and make sure everything's fine. In the meantime, you can also do pd.to_numeric(errors="ignore") to ignore errors for now.

Hive UDF with Python

I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, doing the transformations and then exporting another csv, I would like to take a hive table as the input, and then export the results as a new hive table containing the transformed data.
Python Code:
import pandas as pd
data = pd.read_csv('Input.csv')
df = data
df = df.set_index(['Field1','Field2'])
Dummies=pd.get_dummies(df['Field3']).reset_index()
df2=Dummies.drop_duplicates()
df3=df2.groupby(['Field1','Field2']).sum()
df3.to_csv('Output.csv')
You can make use of the TRANSFORM function to make use of a UDF written in Python. The detailed steps are outlined here and here.

Categories

Resources