Pandas Python: Delete row values by name - python

I have a csv list of keywords in this format:
75410,Sportart
75419,Ballsport
75428,Basketball
76207,Atomenergie
76212,Atomkraftwerk
76223,Wiederaufarbeitung
76225,Atomlager
67869,Werbewirtschaft
I read the values using pandas and create a table in this format:
DF: name
id
75410 Sportart
75419 Ballsport
75428 Basketball
76207 Atomenergie
76212 Atomkraftwerk
... ...
251450 Tag und Nacht
241473 Kollektivverhalten
270930 Indigene Völker
261949 Wirtschaft und Politik
282512 Impfen
Using the name, I want to delete the whole row, e.g. 'Sportart' deletes first row.
I want to check this with values from my wordList array, I store them as Strings in a list.
What did I miss? Using the code below I receive an '(value) not in axis' error.
df = pd.read_csv("labels.csv", header=None, index_col=0)
df.index.name = "id"
df.columns = ["name"]
print('DF: ',df)
df.drop(labels=wordList, axis=0,inplace=True)
pd_frame = pd.DataFrame(df)
cleaned_pd_frame = pd_frame.query('name != {}'.format(wordList))
I succeeded to hide them with query(), but I want to remove the entirely.

You can use a helper function, index_to_drop below, to take in a name and filter its index out:
index_to_drop = lambda name: df.index[df['name']==name]
Then you can drop "Sportart" like:
df.drop(index_to_drop('Sportart'), inplace=True)
print(df)
Output:
id name
1 75419 Ballsport
2 75428 Basketball
3 76207 Atomenergie
4 76212 Atomkraftwerk
5 251450 Tag und Nacht
6 241473 Kollektivverhalten
7 270930 Indigene Völker
8 261949 Wirtschaft und Politik
9 282512 Impfen
That being said, this is just a convoluted way to drop a row. The same outcome can be obtained much simpler by using isin:
df = df[df['name']!='Sportart']

Related

FIlrer csv table to have just 2 columns. Python pandas pd .pd

i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71

In Python, If there is a duplicate, use the date column to choose the what duplicate to use

I have code that runs 16 test cases against a CSV, checking for anomalies from poor data entry. A new column, 'Test case failed,' is created. A number corresponding to which test it failed is added to this column when a row fails a test. These failed rows are separated from the passed rows; then, they are sent back to be corrected before they are uploaded into a database.
There are duplicates in my data, and I would like to add code to check for duplicates, then decide what field to use based on the date, selecting the most updated fields.
Here is my data with two duplicate IDs, with the first row having the most recent Address while the second row has the most recent name.
ID
MnLast
MnFist
MnDead?
MnInactive?
SpLast
SpFirst
SPInactive?
SpDead
Addee
Sal
Address
NameChanged
AddrChange
123
Doe
John
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
123 place
05/01/2022
11/22/2022
123
Doe
Dan
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
789 road
11/01/2022
05/06/2022
Here is a snippet of my code showing the 5th testcase, which checks for the following: Record has Name information, Spouse has name information, no one is marked deceased, but Addressee or salutation doesn't have "&" or "AND." Addressee or salutation needs to be corrected; this record is married.
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/file.csv", encoding='latin-1' )
# Create array to store which test number the row failed
data['Test Case Failed']= ''
data = data.replace(np.nan,'',regex=True)
data.insert(0, 'ID', range(0, len(data)))
# There are several test cases, but they function primarily the same
# Testcase 1
# Testcase 2
# Testcase 3
# Testcase 4
# Testcase 5 - comparing strings in columns
df = data[((data['FirstName']!='') & (data['LastName']!='')) &
((data['SRFirstName']!='') & (data['SRLastName']!='') &
(data['SRDeceased'].str.contains('Yes')==False) & (data['Deceased'].str.contains('Yes')==False)
)]
df1 = df[df['PrimAddText'].str.contains("AND|&")==False]
data_5 = df1[df1['PrimSalText'].str.contains("AND|&")==False]
ids = data_5.index.tolist()
# Assign 5 for each failed
for i in ids:
data.at[i,'Test Case Failed']+=', 5'
# Failed if column 'Test Case Failed' is not empty, Passed if empty
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
failed['Test Case Failed'].value_counts()
# Print to console
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# output two files
failed.to_csv("C:/Users/Failed.csv", index = False)
passed.to_csv("C:/Users/Passed.csv", index = False)
What is the best approach to check for duplicates, choose the most updated fields, drop the outdated fields/row, and perform my test?
First, try to set a mapping that associates update date columns to their corresponding value columns.
date2val = {"AddrChange": ["Address"], "NameChanged": ["MnFist", "MnLast"], ...}
Then, transform date columns into datetime format to be able to compare them (using argmax later).
for key in date2val.keys():
failed[key] = pd.to_datetime(failed[key])
Then, group by ID the duplicates (since ID is the value that decides whether it is a duplicate), and for each date column get the maximum value in the group (which refers to the most recent update) and retrieve the columns to update from the initial mapping. I'll update the last row and set it as the final updated result (by putting it in corrected list).
corrected = list()
for _, grp in failed.groupby("ID"):
for key in date2val.keys():
recent = grp[key].argmax()
for col in date2val[key]:
grp.iloc[-1][col] = grp.iloc[recent][col]
corrected.append(grp.iloc[-1])
corrected = pd.DataFrame(corrected)
Preparing data:
import pandas as pd
c = 'ID MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead Addee Sal Address NameChanged AddrChange'.split()
data1 = '123 Doe John No No Doe Jane No No Mr.JohnDoe Mr.John 123place 05/01/2022 11/22/2022'.split()
data2 = '123 Doe Dan No No Doe Jane No No Mr.JohnDoe Mr.John 789road 11/01/2022 05/06/2022'.split()
data3 = '8888 Brown Peter No No Brwon Peter No No Mr.PeterBrown M.Peter 666Avenue 01/01/2011 01/01/2011'.split()
df = pd.DataFrame(columns = c, data = [data1, data2, data3])
df.AddrChange.astype('datetime64')
df.NameChanged.astype('datetime64')
df
DataFrame is like the example:
Then you pick a piece of the dataframe avoiding changes in original. Adjacent rows have the same ID and the first one has the apropriate name:
df1 = df[['ID', 'MnFist', 'NameChanged']].sort_values(by=['ID', 'NameChanged'], ascending = False)
df1
Then you build a dictionary putting key as df.ID and the appropriate name for its value. You intend to build all the column MnFist:
d = {}
for id in set(df.ID.values):
df_mask = df1.ID == id # filter only rows with same id
filtered_df = df1[df_mask]
if len(filtered_df) <= 1:
d[id] = filtered_df.iat[0, 1] # id has only one row, so no changes
continue
for name in filtered_df.MnFist:
if name in ['unknown', '', ' '] or name is None: # name discards
continue
else:
d[id] = name # found a servible name
if id not in d.keys():
d[id] = filtered_df.iat[0, 1] # no servible name, so picked the first
print(d)
The partial output of the dictionary is:
{'8888': 'Peter', '123': 'Dan'}
Then you build all the column:
df.MnFist = [d[id] for id in df.ID]
df
The partial output is:
Then the same procedure to the other column:
df1 = df[['ID', 'Address', 'AddrChange']].sort_values(by=['ID', 'AddrChange'], ascending = False)
df1
d = { id: df1.loc[df1.ID == id, 'Address'].values[0] for id in set(df.ID.values) }
d
df.Address = [d[id] for id in df.ID]
df
The final output is:
Edited after author comented possibility of unknow inservible data.
Let me restate what I understood from the question:
You have a dataset on which you are doing several sanity checks. (Looks like you already have everything in place for this step)
In next step you are finding duplicates row with different columns updated at different dates. (I assume that you already have this)
Now, you are looking for a new dataset that has non-duplicated rows with updated fields using the latest date entries.
First, define different dates and their related columns in a form of dictionary:
date_to_cols = {"AddrChange": "Address", "NameChanged": ["MnLast", "MnFirst"]}
Next, apply group by using "ID" and then get the index for maximum value of different dates. Once we have the index, we can pull the related fields for that date from the data.
data[list(date_to_cols.keys())] =data[list(date_to_cols.keys())].astype('datetime64')
latest_data = df.groupby('ID')[list(date_to_cols.keys())].idxmax().reset_index()
for date_field, cols_to_update in date_to_cols.items():
latest_data[cols_to_update] = latest_data[date_field].apply(lambda x: data.iloc[x][cols_to_update])
latest_data[date_field] = latest_data[date_field].apply(lambda x: data.iloc[x][date_field])
Next, you can merge these latest_data with the original data (after removing old columns):
cols_to_drop = list(latest_data.columns)
cols_to_drop.remove("ID")
data.drop(columns= cols_to_drop, inplace=True)
latest_data_all_fields = data.merge(latest_data, on="ID", how="left")
latest_data_all_fields.drop_duplicates(inplace=True)

Pandas reading tall data into a DataFrame

I have a text file which consists of tall data. I want to iterate through each line within the text file and create a Dataframe.
The text file looks like this, note that the same fields don't exist for all Users (e.g some might have an email field some might not), Also note that each User is separated by[User]:
[User]
Field=Data
employeeNo=123
last_name=Toole
first_name=Michael
language=english
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
My issue is as follows:
My code iterates through the data but for any field that is not present for that User it iterates down through the list and takes the next piece of data relating to that field.
For example Look at the output below (click on the link below) the first User does not have an email associated with him so the code assigns the email of the second user in the list, however what I want to do is return Nan/N/A/blank if no information is available
Click here to view DataFrame
## Import Libraries
import pandas as pd
import numpy as np
from pandas import DataFrame
## Import Data
## Set column names so that no lines in the text file are missed"
col_names = ['Field',
'Data']
## If you have been sent this script you need to change the file path below, change it to where you have the .txt file saved
textFile = pd.read_csv(r'Desktop\SampleData.txt', delimiter="=", engine='python', names=col_names)
## Get a list of the unique IDs
new_cols = pd.unique(textFile['Field'])
userListing_DF = pd.DataFrame()
## Create a for loop to iterate through the first column and get the unique columns, then concatenate those unique values with data
for col in new_cols:
tmp = textFile[textFile['Field'] == col]
tmp.reset_index(inplace=True)
userListing_DF = pd.concat([userListing_DF, tmp['Data']], axis=1)
userListing_DF.columns = new_cols
Read in the single long column, and then form a group indicator by seeing where the value is '[User]'. Then separate the column labels and values, with a str.split and join back to your DataFrame. Finally pivot to your desired shape.
df = pd.read_csv('test.txt', sep='\n', header=None)
df['Group'] = df[0].eq('[User]').cumsum()
df = df[df[0].ne('[User]')] # No longer need these rows
df = pd.concat([df, df[0].str.split('=', expand=True).rename(columns={0: 'col', 1: 'val'})],
axis=1)
df = df.pivot(index='Group', columns='col', values='val').rename_axis(columns=None)
Field Location department email employeeNo first_name language last_name role
Group
1 Data NaN Marketing NaN 123 Michael english Toole Marketing Lead
2 NaN Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN NaN damian.lee#email.com 998 Damian english Lee NaN

How to concatenate sum on apply function and print dataframe as a table format within a file

I am trying to concatenate the 'count' value into the top row of my dataframe.
Here is an example of my starting data:
Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5
df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
df_new = df.groupby(['Name', 'IP'])['Count'].apply(lambda x:x.astype(int).sum())
If I print df_new this produces the following output:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
................Excel,15
Fred,200.200.200,MsWord,6
................Python,6
As you can see, the count has correctly been calculated, for Tom it has added 5 to 10 and got an output of 15. However, this is displayed on every row of the group.
Is there any way to get the output as follows - so the count is only on the first line of the group:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
.................Excel
Fred,200.200.200,MsWord,6
.................Python
Is there anyway to write dt_new to a file in this nice format?
I would like the output to appear like a table and almost look like an excel sheet with merged cells.
I have tried dt_new.to.csv('path') but this removes the nice formatting I am seeing when I output dt to the console.
It is a bit of a challenge to treat a DataFrame and have it provide summary rows. Generally, the DataFrame lends itself to results that are not dependent on position, such as the last item in a group. Can be done, but better to separate those concerns.
import pandas as pd
from StringIO import StringIO
data = StringIO("""Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5""")
#df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
#df_new = df.groupby(['Name', 'IP', 'Application'])['Count'].apply(lambda x:x.astype(int).sum())
df = pd.read_csv(data)
new_df = df.groupby(['Name', 'IP']).sum()
# reset the two levels of columns resulting from the groupby()
new_df.reset_index(inplace=True)
df.set_index(['Name', 'IP'], inplace=True)
new_df.set_index(['Name', 'IP'], inplace=True)
print(df)
Application Count
Name IP
Tom 100.100.100 MsWord 5
100.100.100 Excel 10
Fred 200.200.200 Python 1
200.200.200 MsWord 5
print(new_df)
Count
Name IP
Fred 200.200.200 6
Tom 100.100.100 15
print(new_df.join(df, lsuffix='_lsuffix', rsuffix='_rsuffix'))
Count_lsuffix Application Count_rsuffix
Name IP
Fred 200.200.200 6 Python 1
200.200.200 6 MsWord 5
Tom 100.100.100 15 MsWord 5
100.100.100 15 Excel 10
From here, you can use the multiindex to access the sum of the groups.

How can I create a data frame from a list of lists with different lengths in Python?

I am using PySpark Python3 - Spark 2.1.0 and I have the a list of differents list, such as:
lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
This list have elements with different lengths. So now, I would like to create a DataFrame from this list, where the columns are the first attribute (i.e. 'FILE, NAME, SURNAME, BIRTHDATE, NATIONALITY) and the data is the second attribute.
As you can see, the second list has not the column 'BIRTHDATE', I need the DataFrame to create this column with a NaN or white space in this place.
Also, I need DataFrame to be like this:
FILE NAME SURNAME BIRTHDATE NATIONALITY
----------------------------------------------------
123.xml ANA LÓPEZ 05-05-2000 ESP
458.xml JUAN PÉREZ NaN ESP
789.xml PEDRO CASTRO 07-07-2007 ESP
The data of the lists have to be in the same columns.
I have done this code, but it doesn't seems like the table I'd like:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
d = dictOfWords
tabla = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictOfWords.items() ]))
tabla_final = tabla.transpose()
tabla_final
Also, I have done this:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
print(dictOfWords)
tabla = pd.DataFrame.from_dict(dictOfWords, orient='index')
tabla
And the result is not good.
I would like a pandas DataFrame and a Spark DataFrame if it is possible.
Thanks!!
The following should work in your case:
In [5]: lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
...: ['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
...: ['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
...: ['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
In [6]: pd.DataFrame(list(map(dict, lista_archivos)))
Out[6]:
BIRTHDATE FILE NAME NATIONALITY SURNAME
0 05-05-2000 123.xml ANA ESP LÓPEZ
1 NaN 458.xml JUAN ESP PÉREZ
2 07-07-2007 789.xml PEDRO ESP CASTRO
Essentially, you convert your sublists to dict objects, and feed a list of those to the data-frame constructor. The data-frame constructor works with list-of-dicts very naturally.

Categories

Resources