python - pandas dataframe processing - python

So I am back with another question about python and pandas.
I have table1 with following columns:
ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3
table2
ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000
Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.
df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')
for index, row in df_table1 .iterrows():
df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
df_result.append(df_candidates.iloc[random_numbers])
In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.
EDIT:
I am sorry for editing my question so late.. was busy with other stuff...
As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.
So the logic should be something like this:
Read the first line of table1.
1;3;xyz1
Find all records with same FOREIGN_ID in table2
count = 3, foreign_id = xyz1
Rows with foreign_id = xyz1 are rows:
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
Because count = 3 I must randomly choose 3 of those records.
I do this with the following line:
df_candidates is table of all suitable records (table above)
random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)
Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.
Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..

To select rows, containing only values from Table1, you can use, for example, pd.merge :
col = "FOREIGN_ID"
left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")
Or df.isin():
ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]
Then to select random sample per group:
def select_random_row(grp):
choice = np.random.randint(len(grp))
return grp.iloc[choice]
filtered.groupby(col).apply(select_random_row)

Have you looked into using pd.merge()
Your call would look something like:
results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')

Related

SQLite AUTO_INCREMENT id field is upside down on the program

here a pic to a better understand
[1]: https://i.stack.imgur.com/S6tpl.png
def consult(self):
book = self.cuadro_blanco_cliente.get_children()
for elementos in book:
self.cuadro_blanco_cliente.delete(elementos)
query = "SELECT Nro, codigo, nombre, nfc, telefono, celular,direccion FROM clientes"#
rows = self.run_query(query)#query
for row in rows:
self.cuadro_blanco_cliente.insert('',0, text=row[1],values=row)
The problem isn't on the id field, is in the way you are using to add the rows on the display. You are traversing the array from id 1 to n, but adding the rows always to the beginning, making it look like the ids go from n to 1.
Try adding this at the end of your query clause:
"... ORDER BY id DESC"
This way, you will insert first, the last element, and then insert the other rows before the last, and so on, securing the fetched rows are ordered by id.
I added some lines to the code, and fixed the problem, begin from 1 now
for row in rows:
id = row[0]
self.cuadro_blanco_cliente.insert("",END, id, text=id, values=row)

remove rows from a table but make the index of the rows according to the number of items

I made a script for a table, but sometimes the table maker inserts some things that aren't values. I wanted to do something automatic, without having to go into the tables and remove those rows. I managed to remove these lines, but the index of the lines remains and I need the index of the lines to be correct (0, 1, 2, 3, 4, 5, 6...)
For example, in the table I'm currently using, the row with index 0 has to be removed and when I make the new table the table's index starts with 1
for i in range(table.shape[0]):
if pd.isna(table[1][i]) == True:
table = table.drop(labels=i, axis=0)
table2 = pd.DataFrame(table)
With Keith Johnson's answer, I researched and managed to find the solution
table2.reset_index(inplace=True, drop=True)
You can reset the index
table2 = table2.reset_index(drop=True)

Extract specific column and group them from dictionary in Python

I want to extract specific columns and group them from the records which I get using MySQLdb. I have written following code:
import _mysql
cdb=_mysql.connect(host="myhost",user="root",
passwd="******",db="my_db")
qry = "select col1,col2,col3,col4,col5,col6 from mytable"
cdb.query(qry)
resultset = cdb.store_result()
records = resultset.fetch_row(0,1) # 0 - no limit, 1 - output is in dictionary form
I want to extract only 3 columns: col1, col3 and col4 from the records and want to make groups of unique values using these three columns i.e. all unique combinations of (col1,col3,col4). I know I have to use set() datatype to find unique values and I tried to used it but I din't find any success. Let me know what will be the good solution for it.
I have thousand of records in the database. I am getting the output of records in following way:
({
'col1':'data11',
'col2':'data11',
'col3':'data13',
'col4':'data14',
'col5':'data15',
'col6':'data16'
},
{
'col1':'data21',
'col2':'data21',
'col3':'data23',
'col4':'data24',
'col5':'data25',
'col6':'data26'
})
I have come up with this solution:
def filter_unique(records, columns):
unique = set(tuple(rec[col] for col in columns) for rec in records)
return [dict(zip(columns, items)) for items in unique]
It first generates a tuple of column values for each record, then removes non-unique occurrences with set(), then reconstructs dictionary by giving names to each value in a tuple.
Call it like this :
filtered_records = filter_unique(records, ['col1','col2','col3'])
Disclaimer: I am a python beginner myself, so my solution might not be the best or the most optimized one.

Pandas: Merge array is too big, large, how to merge in parts?

When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).
I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.
How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?
The method below works correctly for this small data, but fails on the larger, real data:
import pandas as pd
import numpy as np
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']
# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]
This is a follow up to two previous questions:
Does iterrows have performance issues?
What is a good way to avoid using iterrows in this example?
I answer my own Q below.
You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.
As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.
Following on from after you create the input tables
table3 = pd.DataFrame()
grouped = table1.groupby('scenario')
for _, group in grouped:
temp = pd.merge(group,table2, on='letter')
temp['calc']=temp['number1']*temp['number2']
table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
del temp

Dynamically handling data columns in csv for import to Postgresql

I'm new to python (3) and having a hard time with finding relevant examples for how to handle the following scenario. I know this is on the verge of being a "what's best" question, but hopefully there is a clearly appropriate methodology for this.
I have csv data files that contain timestamps and then at least one column of data with a name defined by a master list (i.e. all possible column headers are known). For example:
File1.csv
date-time, data a, data b
2014-01-01, 23, 22
2014-01-01, 23, 22d
File2.csv
date-time, data d, data a
2014-01-01, 99, 20
2014-01-01, 100, 22
I've been going in circles trying to understand when to use tuples, lists, and dictionaries for this type of scenario for import into postgresql. Since the column order can change and the list of columns is different each time (although always from a master set), I'm not sure on how to best generate a data set that includes the time stamp and columns and then perform an insert into a postgresql table where unspecified columns are provided a value.
Given the dynamic nature of the columns' presence and the need to maintain the relationship with the timestamp for the Postgresql import via psycopg, what is recommended? Lists, lists of lists, dictionaries, or tuples?
I'm not begging for specific code, just some guidance. Thanks.
You can use csv module to parse input file and by it's first row you can build (prepare) psycopg insert statement with column names and %s instead of values. For rest of rows simply execute this statement with row as values:
connect_string = 'dbname=test host=localhost port=5493 user=postgres password=postgres'
connection = psycopg2.connect(connect_string)
cursor = connection.cursor()
f = open(fn, 'rt')
try:
reader = csv.reader(f)
cols = []
for row in reader:
if not cols:
cols = row
psycopg_marks = ','.join(['%s' for s in cols])
insert_statement = "INSERT INTO xyz (%s) VALUES (%s)" % (','.join(cols), psycopg_marks)
print(insert_statement)
else:
print(row)
cursor.execute(insert_statement, row)
finally:
f.close()
...
For your example you will have to correct column names.

Categories

Resources