django-tables2 doesn't sort columns properly - python

In the project, I have used RequestConfig(request).configure(table) to apply sorting across the columns. All of them are defined as ArrayField(models.CharField(max_length=50), null= True) . The problem is that the title and year can be sorted, but the other three cannot be sorted properly. I get larger values among the smaller ones. I suppose the Title and Year are strings, but the others are lists containing integers. Can some one have a lead on how to properly sort the three columns in the middle ?Table Headers
Second column not sorted properly

This appears to be the result of sorting integers that are stored as strings. You have pointed out that the columns of your table are defined using a CharField model, which according to the Django docs, is for storing strings.
You should probably define each of the CoAuthors, Citations, and Ncitations columns using IntegerField.

Related

How to use DataFrame.isin without the constraint of having to match both index and value?

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:
print(df1['Col1'].isin(df2['col3']).value_counts())
This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?
Note: I have checked for whitespace in the columns as well. There is no whitespace.
PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?
First, convert the column you use as parameter inside your isin() method as a list.
Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.
From your example:
print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())
Try running that again.

How to merge rows using pandas if they have duplicate values?

I have a specific case with my data which I’m unable to find an answer to in any documentation or on stack.
What I’m trying to do is merge duplicates based on the ‘MPN’ Column (and not the Vehicle column).
There are going to be duplicates of MPNs in lots of rows as shown in the first image.
I obviously want to remove duplicate rows which have the same MPN, but MERGE the Category values from the three rows as shown in Image 1 in to one cell separated by colons as shown in Image Two, which would be my desired result after coded.
What I’m asking for: To be able to Merge and Remove duplicates based on rows that contain a duplicate MPN, and merge them in to ONE while retaining the categories separated by a colon.
Look at my before and after images to understand more clearly.
I’m also using Python 3.7 to code this from a csv file, separated by commas.
Before:
After duplicates have merged:
How do I solve the problem?
Assuming df holds you csv data.
First group by based on common column(Vehicle and MNP) and make and update a common separated string on category column.
df['x'] = df.groupby(['foo','bar'])['x'].transform(lambda x: ':'.join(x))
Second remove duplicates
df.drop_duplicates()

Mutable indexed heterogeneous data structure?

Is there a data class or type in Python that matches these criteria?
I am trying to build an object that looks something like this:
ExperimentData
ID 1
sample_info_1: character string
sample_info_2: character string
Dataframe_1: pandas data frame
Dataframe_2: pandas data frame
ID 2
(etc.)
Right now, I am using a dict to hold the object ('ExperimentData'), which containsnamedtuple's for each ID. Each of the namedtuple's has a named field for the corresponding data attached to the sample. This allows me to keep all the ID's indexed, and have all of the fields under each ID indexed as well.
However, I need to update and/or replace the entries under each ID during downstream analysis. Since a tuple is immutable, this does not seem to be possible.
Is there a better implementation of this?
You could use a dict of dicts instead of a dict of namedtuples. Dicts are mutable, so you'll be able to modify the inner dicts.
Given what you said in the comments about the structures of each DataFrame-1 and -2 being comparable, you could also group all of each into one big DataFrame, by adding a column to each DataFrame containing the value of sample_info_1 repeated across all rows, and likewise for sample_info_2. Then you could concat all the DataFrame-1s into a big one, and likewise for the DataFrame-2s, getting all your data into two DataFrames. (Depending on the structure of those DataFrames, you could even join them into one.)

appending non-unique rows to another database using python

Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.

Change dtype of a single column in a 2d numpy array

I am creating a 2d array full of zeros with the following line of code:
MyNewArray=zeros([4,12],float)
However, the first column will need to be populated with string-type textual data, while all the other columns will need to be populated with numerical data that can be manipulated mathematically.
How can I edit the code above so that the first column in the matrix can be of the string data type while keeping all the other columns as float?
You might want to use structured arrays
MyNewArray = zeros(12, dtype='S10,f4,f4,f4')
There are several ways of defining the structure, here I have defined 4 fields: one text with 10 characters, and three floats (you could write float instead of f4).
It is important to note that the number of characters of the array has to be specified, for array memory management reasons. You won't be able to store strings longer than this maximum length.
Each field is referenced by a field name, in this case, default field names f0 to f3 will been used. For example, to get the whole first column (the textual one):
MyNewArray['f0']
Of course, you can modifiy field names as you wish.

Categories

Resources