How to link data records in pandas? - python

I have a csv file which I can read into a pandas data frame. The data is like:
+--------+---------+------+----------------+
| Name | Address | ID | Linked_To |
+--------+---------+------+----------------+
| Name 1 | ABC | 1233 | 1234;1235 |
| Name 2 | DEF | 1234 | 1233;1236;1237 |
| Name 3 | GHI | 1235 | 1234;1233;2589 |
+--------+---------+------+----------------+
How do I run analysis on the linkage between ID and the Linked_To columns. For example, should I be turning the Linked_To values into a list and doing a VLOOKUP type analysis on the ID column? I know there must be an obvious way to do this but I am stumped.
Ideally the end result should be a list or dictionary which has the entire attributes of the row, including all of the other records its linked to.
OR is this a problem where I should be transforming the data into an SQL database?

for the unique and non-unique cases, a dictionary of IDs in linked_to for each ID could be obtained via:
def linked_ids(df):
#set up the dictionary
dict = {}
#iterate through the rows
for row in df.index:
#separate the semi-colon delimited linked to field
linked_to = df.ix[row,'Linked_to'].split(";")
if df.ix[row,'ID'] not in dict.keys():
dict[df.ix[row,'ID']] = []
for linked_id in linked_to:
if linked_id not in dict[df.ix[row,'ID']]:
dict[df.ix[row,'ID']].append(linked_id)
else:
for linked_id in linked_to:
if linked_id not in dict[df.ix[row,'ID']]:
dict[df.ix[row,'ID']].append(linked_id)
return dict

If you working with pandas dataframe , try this
df.set_index('ID').Linked_To.str.split(';').to_dict()
Out[142]:
{1233: ['1234', '1235'],
1234: ['1233', '1236', '1237'],
1235: ['1234', '1233', '2589']}

Related

I have a date column in a pyspark dataframe that I want to change the title of and extract only the last 8 characters from while preserving its order

my dataframe looks like this:
| accountId | income | dateOfOrder
| 123 | 60000 | 56347264327_01_20200110
| 321 | 52000 | 54346262452_01_20200218
I want to take the header dateOfOrder and change it to acct_order_dt and only use the last 8 characters which are dates in yyyymmdd. I want to preserve the order of this pyspark dataframe.
I am currently using this method but I dont think it's preserving the order:
sample_data = sample_data.withColumn("acct_order_dt", to_date(substring(col("dateOfOrder"),-8,8), "yyyyMMdd")).drop("dateOfOrder")

Passing dataframe and using its name to create the csv file

I have a requirment where i need to pass different dataframes and print the rows in dataframes to the csv file and the name of the file needs to be the dataframe name. Example Below is the data frame
**Dataframe**
| Students | Mark | Grade |
| -------- | -----|------ |
| A | 90 | a |
| B | 60 | d |
| C | 40 | b |
| D | 45 | b |
| E | 66 | d |
| F | 80 | b |
| G | 70 | c |
A_Grade=df.loc[df['grade']=='a']
B_Grade=df.loc[df['grade']=='b']
C_Grade=df.loc[df['grade']=='c']
D_Grade=df.loc[df['grade']=='d']
E_Grade=df.loc[df['grade']=='e']
F_Grade=df.loc[df['grade']=='f']
each of these dataframes A_Grade,B_Grade,C_Grade etc needs to be created in separate file with name A_Grade.csv,B_Grade.csv,C_Grade.csv.
i wanted to use a for loop and pass the dataframe so to create it rather than writing separate lines to create the files as number of dataframe varies . This also sends msg using telegram bot.so the code snippet i tried is below. but it didnt work. in short the main thing is dynamically create the csv file with the dataframe name.
for df in (A_Grade,B_Grade,C_Grade):
if(len(df))
dataframeitems.to_csv(f'C:\Documents\'+{df}+'{dt.date.today()}.csv',index=False)
bot.send_message(chat_id=group_id,text='##{dfname.name} ##')
The solution given #Ynjxsjmh by work. Thanks #Ynjxsjmh. but i have another senario where in a function as below has dataframe passed as argument and the result on the dataframe needs to be saved as csv with dataframe name.
def func(dataframe):
...
...
...
dataframe2=some actions and operations on dataframe
result = dataframe2
result.to_csv(params.datafilepath+f'ResultFolder\{dataframe}_{dt.date.today()}.csv',index=False)
The file needs to be saved as per the name of the dataframe.csv
I could get using below code
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
filename=get_df_name(dataframe)
print(filename)
Your f-string is weird, you can use Series.unique() to get unique values in Series.
for grade in df['Grade'].unique():
grade_df = df[df['Grade'].eq(grade)]
grade_df.to_csv(f'C:\Documents\{grade.upper()}_Grade_{dt.date.today()}.csv', index=False)

Add column from one dataframe to another WITHOUT JOIN

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.

Use a CSV to create a SQL table

I have a csv file with the following format:
+--------------+--------+--------+--------+
| Description | brand1 | brand2 | brand3 |
+--------------+--------+--------+--------+
| afkjdjfkafj | 1 | 0 | 0 |
| fhajdfhjafh | 1 | 0 | 0 |
| afdkjfkajljf | 0 | 1 | 0 |
+--------------+--------+--------+--------+
I want to write a python script that reads the csv and create a table in sql. I want the table to have the description and the derived brand. If there is a one in the brand name column of the csv then the description is associated with that brand. I want to then create a sql table with the description and the associated brand name.
The table will be :
+-------------+---------------+
| Description | derived brand |
+-------------+---------------+
| afkjdjfkafj | brand 1 |
+-------------+---------------+
So far I have written the code for reading the csv and made the descriptions a list.
df = pd.read_csv(SOURCE_FILE, delimiter=",")
descriptions = df['descriptions'].tolist()
Please provide some guidance on how to read the file and achieve this because I am so lost. Thanks!
I just answered a similar question on dba.stackexchange.com, but here's the basics
Create your table...
create table myStagingTable (Description varchar(64), Brand1 bit, Brand2 bit, Brand3 bit)
Then, bulk insert into it but ignore the first row, if your first row has column headers.
bulk insert myStagingTable
from 'C:\somefile.csv',
with( firstrow = 2,
fieldterminator = ',',
rowterminator = '\n')
Now your data will be in a table just like it is in your excel file.To insert it into your final table, you can use IIF and COALESCE
insert into finalTable
select distinct
[Description]
,DerivedBrand = coalesce(iif(Brand1 = 1,'Brand1',null),iif(Brand2 = 1,'Brand2',null),iif(Brand3 = 1,'Brand3',null))
from myStagingTable
See a DEMO HERE

Retrieve column name of last month of transactions in Pandas

Let's say I have a dataframe formatted the following way:
id | name | 052017 | 062017 | 072017 | 092017 | 102017
20 | abcd | 0 | 100 | 200 | 50 | 0
I need to retrieve the column name of the last month an organization had any transactions. In this case, I would like to add a column called "date_string" that would have 092017 as its contents.
Any way to achieve this?
Thanks!
replace 0 to np.nan then using last_valid_index
df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)
Out[602]:
0 092017
dtype: object
#df['newcol'] = df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)

Categories

Resources