I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!
Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18
You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.
from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)
Related
I have a data file with column names like this (numbers in the name from 1 to 32):
inlet_left_cell-10<stl-unit=m>-imprint)
inlet_left_cell-11<stl-unit=m>-imprint)
inlet_left_cell-12<stl-unit=m>-imprint)
-------
inlet_left_cell-9<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
I would like to sort the columns (with data) from left to right in python based on the number in the columns. I need to move a whole column based on the number in the column name.
So xxx-1xxx, xxx-2xx, xxx-3xxx, ...... xxx-32xxx
inlet_left_cell-1<stl-unit=m>-imprint)
inlet_left_cell-2<stl-unit=m>-imprint)
inlet_left_cell-3<stl-unit=m>-imprint)
-------
inlet_left_cell-32<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
Is there any way to do this in python ? Thanks.
Here is the solution
# Some random data
data = np.random.randint(1,10, size=(100,32))
# Setting up column names as given in your problem randomly ordered
columns = [f'inlet_left_cell-{x}<stl-unit=m>-imprint)' for x in range(1,33)]
np.random.shuffle(columns)
# Creating the dataframe
df = pd.DataFrame(data, columns=columns)
df.head()
# Sorting the columns in required order
col_nums = [int(x.split('-')[1].split('<')[0]) for x in df.columns]
column_map = dict(zip(col_nums, df.columns))
df = df[[column_map[i] for i in range(1,33)]]
df.head()
There many ways to do it...I'm just posting simply way.
Simply extract column names & sort them using natsort.
Assuming Dataframe name as df..
from natsort import natsorted, ns
dfl=list(df) #used to convert column names to list
dfl=natsorted(dfl, alg=ns.IGNORECASE) # sort based on subtsring numbers
df_sorted= df[dfl] #Re arrange Df
print(df_sorted)
If the column names differ only by this number, try this:
import pandas as pd
data = pd.read_excel("D:\\..\\file_name.xlsx")
data = data.reindex(sorted(data.columns), axis=1)
For example:
data = pd.DataFrame(columns=["inlet_left_cell-23<stl-unit=m>-imprint)", "inlet_left_cell-47<stl-unit=m>-imprint)", "inlet_left_cell-10<stl-unit=m>-imprint)", "inlet_left_cell-12<stl-unit=m>-imprint)"])
print(data)
inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint) inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint)
After this:
data = data.reindex(sorted(data.columns), axis=1)
print(data)
inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint) inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint)
Problem:
Executing this code:
import pandas as pd
data = {"1","2","3","4","5"}
index = ["1_i","2_i","3_i","4_i","5_i"]
df = pd.DataFrame(data,index=index)
print(df)
Results in this output:
0
1_i 4
2_i 3
3_i 5
4_i 1
5_i 2
Question:
Why aren't the values in order according to the index that I set it to?
1 should be set to the index 1_i, 2 should be set to the index 2_i, etc.
The problem is that you are making the dataframe from a set, which is arbitrarily ordered. Try making your dataframe from a container that maintains order, like a list or tuple.
import pandas as pd
data = ["1","2","3","4","5"] # a list
index = ["1_i","2_i","3_i","4_i","5_i"]
df = pd.DataFrame(data, index=index)
BACKGROUND:
I have two columns: 'address' and 'raw_data'. The dataset looks like this:
this is just a sample I made up, the original dataset is over 6m rows and in a different language
Problem:
I need to find all the data where the 'address' and 'raw_data' are not matched meaning there were some sorta of mistakes were made when logging in the data from 'address' to 'raw_data.
I'm fairly new to Pandas. My plan is separate the 'raw_data' column by comma, then compare the newly produced columns with the original 'address' column (to see if the 'address' column has those info, if not, that means there is a mistake?).
Like I said, I'm new to pandas and this is what I have so far.
import pandas as pd
columns = ['address', 'raw_data']
df=pd.read_csv('address.csv', usecols=columns)
df = pd.concat([df['address'], df['raw_data'].str.split(',', expand=True)], axis=1)
Now the new columns has info like this: "CITY":"ATLANTA". I want to the columns to just have ATLANTA without all the the colons and 'CITY' in order to compare the info with 'address' column.
How should I go on about it?
Also, at this point of my pandas learning experience, I do not yet know how to compare two columns. Could someone help a newbie out please? Thanks a lot!
PS: by comparison of two columns I meant to check whether one column has the characters in the second column, not to check whether the two columns are equal. Just want to point that out.
df = pd.DataFrame([[2, 2], [3, 6],[1,1]], columns = ["col1", "col2"])
comparison_column = np.where(df["col1"] == df["col2"], True, False)
df["equal"] = comparison_column
col1 col2 equal
2 2 True
3 6 False
1 1 True
I will use this data:
import numpy as np
import pandas as pd
j = {"address":"foo","b": "bar"}
j2 = {"address":"foo2","b": "bar2"}
values = [["foo", j], ["bar", j2]]
df = pd.DataFrame(data=values, columns=["address", "raw_data"])
df
address raw_data
0 foo {'address': 'foo', 'b': 'bar'}
1 bar {'address': 'foo2', 'b': 'bar2'}
I will separate columns from raw_data (with .values.tolist()) in another df (df2):
df2 = pd.DataFrame(df['raw_data'].values.tolist())
df2
address b
0 foo bar
1 foo2 bar2
To compare you use:
df.address == df2.address
0 True
1 False
If you need save this in the original df you can add a column:
df["result"] = df.address == df2.address
You can separate them from , by just treating them as a dict. You can map custom functions to columns with apply function. In this case you have define a function that accesses to keys of dictionary and extracts values.
df['address_raw'] = df['raw_data'].apply(lambda x: x['address'])
df['city_raw'] = df['raw_data'].apply(lambda x: x['CITY'])
df['addrline2_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_2'])
df['addrline3_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_3'])
df['utmnorthing_raw'] = df['raw_data'].apply(lambda x: x['UTM_NORTHING'])
These lines will create columns of each field in the dict and then you can just compare the ones like:
df['address'] == df['address_raw']
I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.
There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]
One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out