Pyspark join conditions using dictionary values for keys - python

I'm working on a script that tests the contents of some newly generated tables against production tables. The newly generated tables may or may not have the same column names and may have multiple columns that have to be used in join conditions. I'm attempting to write out a function with the needed keys being passed using a dictionary.
something like this:
def check_subset_rel(self, remote_df, local_df, keys):
join_conditions = []
for key in keys:
join_conditions.append(local_df.key['local_key']==remote_df.key['remote_key'])
missing_subset_df = local_df.join(remote_df, join_conditions, 'leftanti')
pyspark/python doesn't like the dictionary usage in local_df.key['local_key'] and remote_df.key['remote_key']. I get a "'DataFrame' object has no attribute 'key'" error. I'm pretty sure that it's expecting the actual name of the column instead of any variable, but I'm not sure if I can make that conversation between value and column name.
Does anyone know how I could go about this?

Related

How do I return the dataframe from a function in Python?

So here is my code:
def assign(x):
x=dataDct['{}'.format(x)]
x=x[x['_id'].str.contains("784561B7F90F")]
return(x)
x is my dataframe name.I pass it to dataDct , which is a dictionary, as a key. This returns me the dataframe when i don't use this function and just type the dataframe name manually. This dictionary is not declared inside the function. I dont know if that is causing a problem. I'm unable to declare the dictionary as global. In the second line I query a certain column of x. And in the end i should get dataFrame x.
So this is my original code:
motions=dataDct['motions']
motions=motions[motions['_id'].str.contains("784561B7F90F")]
motions
motions is the name of the dataframe. The dataframe is already stored in the dictionary with the name as key. I just assigned it to a variable with same name. Since i have to access 33 files like motions and i was doing it manually, I decided to write the function instead. But its not working and im getting a key error.

Reading multiple CSV files with different names using python dictionary in a for loop

I have a list of filenames and filepaths that I have stored in a python dictionary. I am trying to read the files using pandas' read_csv and assign the dataframe names from the dictionary. I can read and print the dataframes while running the for loop but I cannot call the dataframes after the loop is finished. I could append all the dataframes in a list but that way I am not able to assign these dataframes different names which are also stored in the dictionary.
I checked various forums but none of them explain why the for loop with pd.read_csv doesn't work and why I am not able to assign the names to the dataframes in the for loop to later use them.
import pandas as pd
files_dict = {"Filter":"OUTTYPENM.csv","previous":"previous.csv"}
for key, value in files_dict.items():
key = pd.read_csv(value)
Filter.head()
I expect to see the first five lines from the Filter dataframe as if I have read the dataframe as following.
Filter = pd.read_csv("OUTTYPENM.csv")
All the csv files are in the current working directory.
When I run the for loop code, and run the Filter.head(), I get an error saying - NameError: name 'Filter' is not defined
This doesn't exactly answer your question, but I think it gets you to a similar place, without involving any exec() or locals() calls.
Instead of creating variables named after your dictionary keys, you can just have a second dictionary where the keys are the same and the values are now the DFs you read in.
import pandas as pd
files_dict = {"Filter":"OUTTYPENM.csv","previous":"previous.csv"}
df_dict = {}
for key, value in files_dict.items():
df_dict[key] = pd.read_csv(value)
df_dict['Filter'].head()
Try this:
for key, value in files_dict.items():
locals()[key] = pd.read_csv(value)
This method is not recommended though. See the link here:https://www.linuxquestions.org/questions/programming-9/python-create-variables-from-dictionary-keys-859776/

Python/SQLite: smooth way to set column names in CREATE TABLE

I'm building up a table in Python's SQLite module which will consist of 18 columns at least. These columns are named by times (for example "08-00"), all stored in a list called 'time_range'. I want to avoid writing up all 18 table names by hand in the SQL statement, since all this already exists inside the mentioned list and it would make the code quite ugly. Howver, this:
marks = '?,'*18
self.c.execute('''CREATE TABLE finishedJobs (%s)''' % marks, tuple(time_range))
did not work. As it seems, Python/SQLite does not accept parameters at this place. Is there any smart workaround for my purpose or do I really have to name every single column in a CREATE TABLE statement by hand?

How to update all object columns in SqlAlchemy?

I have a table of Users(more than 15 columns) and sometimes I need to completely update all the user attributes.For xample, I want to replace
user_in_db = session.query(Users).filter_by(user_twitter_iduser.user_twitter_id).first()
with some other object.
I have found the following solution :
session.query(User).filter_by(id=123).update({"name": user.name})
but I fell that writing all 15+ attributes is error-prone and there should exist a simpler solution.
You can write:
session.query(User).filter_by(id=123).update({column: getattr(user, column) for column in User.__table__.columns.keys()})
This will iterate over the columns of the User model (table) and it'll dynamically create a dictionary with the necessary keys and values.

group by in django

How can i create simple group by query in trunk version of django?
I need something like
SELECT name
FROM mytable
GROUP BY name
actually what i want to do is simply get all entries with distinct names.
If you need all the distinct names, just do this:
Foo.objects.values('name').distinct()
And you'll get a list of dictionaries, each one with a name key. If you need other data, just add more attribute names as parameters to the .values() call. Of course, if you add in attributes that may vary between rows with the same name, you'll break the .distinct().
This won't help if you want to get complete model objects back. But getting distinct names and getting full data are inherently incompatible goals anyway; how do you know which row with a given name you want returned in its entirety? If you want to calculate some sort of aggregate data for all the rows with a given name, aggregation support was recently added to Django trunk and can take care of that for you.
Add .distinct to your queryset:
Entries.objects.filter(something='xxx').distinct()
this will not work because every row have unique id. So every record is distinct..
To solve my problem i used
foo = Foo.objects.all()
foo.query.group_by = ['name']
but this is not official API.

Categories

Resources