I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
score:0
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
score:1
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values
gives arr
, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using @cython.boundscheck(False)
decorating a function using this.
score:1
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
score:4
A solution in Pandas is a little bit tricky, but could look like this (s
is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
score:7
The complexity is O(n)
. What will slow it down is doing a for
loop in python. If there are k
zeros in the series, and log k
is negligibile comparing to the length of series, an O(n log k)
solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
Credit To: stackoverflow.com
Related Query
- How to count distance to the previous zero in pandas series?
- how to find the count the difference back to the previous zero in a list?
- How to get count of number of columns where the value is not zero row-wise in a pandas dataframe
- How to count the distance in cells (e.g. in indices) between two repeating values in one column in Pandas dataframe?
- How to repeat the cumsum for previous values in a Pandas Series, when the count group is restarted?
- How to count the elements of a list within a pandas series
- How do I get the row count of a Pandas DataFrame?
- How to count the NaN values in a column in pandas DataFrame
- pandas python how to count the number of records or rows in a dataframe
- How do I subtract the previous row from the current row in a pandas dataframe and apply it to every row; without using a loop?
- How to specify the type of pandas series elements in type hints?
- How to use Pandas to get the count of every combination inclusive
- pandas how to find continuous values in a series whose differences are within a certain distance
- How do I slice a pandas time series on dates not in the index?
- How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?
- How can I get a previous row from where the condition is met in data frame in Pandas
- How to convert the time zone of the values of a Pandas Series
- pandas dataframe: how to count the number of 1 rows in a binary column?
- How do I count the total number of words in a Pandas dataframe cell and add those to a new column?
- How do I change the index values of a Pandas Series
- How can I approximate the periodicity of a pandas time Series
- How to extract values from a Pandas DataFrame, rather than a Series (without referencing the index)?
- How do I calculate the Levenshtein distance between two Pandas DataFrame columns?
- How to count consecutive repetitions in a pandas series
- How to remove NaN from a Pandas Series where the dtype is a list?
- How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?
- How do I count the values from a pandas column which is a list of strings?
- How to count the number of upper case words more than 1 character long in series
- How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?
- Take the difference of all elements of a series with the previous ones in python pandas
More Query from same tag
- Add a new column to Pandas DataFrame with coding data from a separate DataFrame without using a Loop?
- Convert a nested dictionary with tuples as keys to a dataframe
- Remove illegal file name characters from pandas dataframe column using python
- How to specify the week number for a given date using pandas?
- using pandas to read a csv file with whatever columns matchi with the column names given in a list
- Pandas fails to read past 216th line of jagged text file
- How count the most frequently repeated phrases in Pandas
- Ordering columns in dataframe
- I want to save urdu scrape data into csv file but when i open the file the format of file is not readable
- Databricks MLFlow AutoML XGBoost can't predict_proba()
- Python & Pandas: Strange behavior when Pandas plot histogram to a specific ax
- Efficiently search for combinations of list values in a pandas DataFrame
- ValueError: The truth value of a Series is ambiguous on creation of new column
- How to create a dataframe from text file having single column
- use pandas cut to make two groups out of three